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(54) Abstract Title: Data compression and recovery 

(57) A system for reducing the size of data stored on a 
computer uses checksum algorithms to create a 
data entry in a computer readable medium. In 
operation, first and second checksums are applied 
to block of data. The results are combined with 
details of the algorithms used and at least one 
attribute of the data block to create a data entry. 
The attribute for the block may be a name, size, 
length, hash type amongst others. The data entry 
may be written in a markup language, preferably 
XML or SGML. The checksum values may be a 
hashed value, a digest or a checksum number. The 
checksum value may be generated by MD2, MD4, 
SHA, CRC, RIPE, CRC16, CRC32 or CRC64 
algorithms. In a further embodiment a recovery 
system is provided. In use the data entry is received 
and checksum algorithms are applied to the data 
block, the results being compared to the checksum 
values in the data entry in order to identify 
candidate blocks for recovery. The checksum 
algorithms may be applied to the data blocks by 
either a linear or non linear scan. The non linear 
scan can be either a skipping, modulus or 
exponential scan. 
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SYSTEMS AND METHODS FOR DATA COMPRESSION AND DECOMPRESSION 

CROSS-REFERENCE TO RELATED APPLICATIONS 

This application claims the benefit of U.S. Provisional Application No. 60/568304, 
filed May 4, 2004 and U.S. Provisional Application No. 60/603,604, filed August 23, 2004. 
5 The entire contents of the above applications are incorporated herein by reference. 

BACKGROUND 

Prior art discloses several branches of compression algorithms. Statistical methods of 
compression relied on reducing the size of frequent patterns in data. Compression was based 
on redundancy of die data and often involved the work of Shannon and others. There were 
1 0 practical limits to compressing data with redundancy. 

Prior art compression is statistical. Compression is based on redundancy 
modeling. The present invention allows checksums to be used to determine whether a 
number in a sequence of numbers matches a given checksum and to make files smaller. 
Currently, files are limited by the amount of redundancy that can be removed. A checksum 
1 5 compressor would operate on how unique a number is by the uniqueness of the checksum - 
i.e., there is only one checksum number that can be associated with a given number. 

Message digests were created to verify that data is the original data and that no 
changes occurred in transmission. They can be used to ensure that computer files have not 
been modified There are two main message digests and signature hashes used today, Secure 
20 Hash Algorithm ("SHA") and Message Digest Algorithm No. 5 ("MD5"). MD5 is one of 
die first message digest algorithms 1 created and was developed by Rivest-Shamir-Adleman 
("RSA"). As is known in the art, MDS has a problem with collisions, and is not guaranteed 
to be as unique as SHA. 

SHA was developed by the U.S. government (N.S. A) to fingerprint computer data. 
25 The government developed SHA as a means of verifying digital information. SHA is a quest 
for generating a unique key or hash for a given set of data. One of the primary government 



1 Other message digest algorithms include MD2 (a message-digest hash function optimized for 8-bit 
machines), and MD4 (a message-digest hash function optimized for 32-bit machines). 
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interests is in tracking data and transmission of data. There are numerous open source 
implementations of the SHA algorithm in C and C++- code. 

This is a note from the US government about SHA security, however, it should not be 
considered the assertion of the US government 

This document specifies a Secure Hash Algorithm, SHA-1, for 
computing a condensed representation of a message or a data file. When a 
message of any length < 2*64 bits is input, the SHA-1 produces a 1 60-bit output 
called a message digest The message digest can thai, for example, be input to a 
signature algorithm which generates or verifies the signature for the message. 
Signing the message digest rather than the message often improves the 
efficiency of the process because the message digest is usually much smaller in 
size than the message. The same hash algorithm must be used by the verifier of 
a digital signature as was used by the creator of die digital signature. Any 
change to the message in transit will, with very high probability, result in a 
different message digest, and the signature will fail to verify. 

The SHA-1 is called secure because it is computationally infeasible to 
find a message which corresponds to a given message digest, or to find two 
different messages which produce the same message digest Any change to a 
message in transit will, with very high probability, result in a different message 
digest and the signature will fail to verify. 

Both SHA and MD5 are known in the art, and further description is not provided 

herein. 

SUMMARY 

The quest for uniqueness of data through message digests can also be used to improve 
compressioa How unique a number is can be used to compress it A checksum can be 
unique or relatively unique with collisions. A unique checksum would generate a unique 
output number for each input A relatively unique checksum would generate numbers that 
are the same for different inputs or it would minimize collisions between inputs. 

In one embodiment, the present system and method is designed to reduce the size of 

data on a computer through compression. In another embodiment, the present system and 

method is designed to improve hash, message digest, and checksum technology and their 

application to information and data storage. In another embodiment the present system and 

method is designed to improve uniqueness by using mutual exclusion in hash and checksum 

2 



tests. In another embodiment, the present system and method is designed to improve 
checksum tests providing better computer security. In another embodiment, the present 
system and method is designed to create an XML compression format and to move binary 
compression formats to XML or other markup language. In another embodiment, the present 
5 system and method is designed to utilize variable length hashes and message digests. In 
another embodiment, the present system and method is designed to create an XML based 
checksum that can be used to verify the integrity of files. 

In addition to the various embodiments discussed herein, the present system and 
method can also be adapted for encryption and verification uses, improved validation and 
10 checksums to ensure that a file is unchanged or unmodified, to increase the bandwidths of 
data transfers, (e.g., ftp and streaming media), and to replace other data compression methods 
(e.g., mpeg). 

Further advantages of the present system and method include increased compression 
capacity with open formats, decreased file sizes, using document type definitions (DTDs) to 

1 5 validate compressed files, using XML or other markup language to describe compressed files, 
and using checksum tests. Such tests can be used to compress numbers or data, and to 
validate a file's integrity. Checksum tests can also be used to verify the integrity of a file 
through a combination of digital signatures, moduluses, or checksums. In combination, they 
can also be used to compress files if the resulting signature checksum is less than the size of a 

20 block of bytes and has few or no collisions. As known in the art, verifying file integrity 
means ensuring that die contents haven't changed. In one embodiment, the present system 
and method also provides for multiple ways to verify file integrity. Checksums can compress 
numbers, and by using multiple checksums, file security and integrity can be implemented 
with stronger sets of checks. 

25 Checksum compression can be used to reduce the size of data, and it could be 

commercialized as an alternative to zip and archive compression formats. It would allow for 
improved streaming media and audio and video files by increasing the quality to bandwidth 
or size ratio. Further, there are many open source message digest programs which increase 
its application. Checksum compression can also reduce storage costs and improve 

30 productivity and efficiency. 

In an aspect, the present invention is directed to a system for data storage comprising 
one or more processors operable to generate a first checksum value for a data block and a 
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second checksum value for the data block, wherein said first checksum value is generated by 
applying a first checksum algorithm to said data block and said second checksum value is 
generated by applying a second checksum algorithm, different from said first checksum 
algorithm, to said data block; one or more processors operable to create a data entry 
5 comprising data identifying: the first and second checksum values, the first and second 

checksum algorithms, and at least one of the identified attributes of the data block; and one or 
more processors operable to store said data entry in a computer-readable medium. 

In various embodiments, the one or more processors operable to generate, the one or 
more processors operable to create, and the one or more processors operable to store may or 
10 may not be distinct. For example, one processor can be operable to generate, create, and 
store. Alternatively, a plurality of processors can be operable to generate, create, and/or 
store. 

In an aspect, the present invention is directed to a system for data storage comprising 
one or more processors operable to identify one or more attributes of a first data block and a 

1 S second data block, said second data block comprising and different from said first data block; 
one or more processors operable to generate a first checksum value for the first data block, 
wherein said first checksum value is generated by applying a first checksum algorithm to said 
first data block; one or more processors operable to generate a second checksum value for the 
second data block, wherein said second checksum value is generated by applying a second 

20 checksum algorithm to said second data block, one or more processors operable to create a 
data entry comprising data identifying the first and second checksum values, and at least one 
of the identified attributes of the first and second data blocks; and one or more processors 
operable to store said data entry in a computer-readable medium. 

In various embodiments, the one or more processors operable to identify, die one or 
25 more processors operable to generate a first checksum value, the one or more processors 
operable to generate a second checksum value, the one or more processors operable create, 
and the one or more processors operable to store may or may not be distinct For example, 
one processor can be operable to identify, generate a first checksum value, generate a second 
checksum value, create, and store. Alternatively, a plurality of processors can be operable to 
30 identify, generate a first checksum value, generate a second checksum value, create, and/or 
store. 
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In an aspect of die present invention the system further comprises one or more 
processors further operable to determine an attribute for die data block, said attribute being 
one of a name, size, length, hash type, checksum type, digest type, padding, floor, ceiling, 
modulus, collision, directory, root, drive, path, date, time, modified date, permission, owner, 
5 or byte order, one or more processors operable to create a data entry comprising the attribute; 
one or more processors operable to store said data entry in a computer-readable medium. 

In various embodiments, the one or more processors operable to determine, the one or 
more processors operable to create, and the one or more processors operable to store may or 
may not be distinct For example, one processor can be operable to determine, create, and 
10 store. Alternatively, a plurality of processors can be operable to determine, create, and/or 
store. 

In various aspects of the present invention the second checksum algorithm is the first 
checksum algorithm; the attributes are one of a name, size, length, hash type, checksum type, 
digest type, padding, floor, ceiling, modulus, collision, directory, root, drive, path, date, time, 

1 S modified date, permission, owner, or byte order; die data entry is written in a markup 

language; the markup language is one of either XML or SGML; die one or more checksum 
values is at least one of a hashed value, a digest, and a checksum number; the one or more 
checksum values is generated using at least one of an MD2 algorithm, an MD4 algorithm, an 
MD5 algorithm, an SHA algorithm, a Cyclic Redundant Checksum algorithm, a Ripe 

20 algorithm, a CRC16 checksum algorithm, a CRC32 checksum algorithm, and a CRC64 
checksum algorithm; and at least 2 of said one or more processors operates in parallel. 

In an aspect, the present invention is directed to a system for data recovery 
comprising one or more processors operable to receive a data entry comprising data 
identifying first and second checksum values, first and second checksum algorithms, and at 

25 least one attribute of a first data block; and based on said data entry; one or more processors 
operable to operable to identify said first data block by (a) applying said first checksum 
algorithm to each block in a first set of data blocks to generate a first set of checksum values, 
each value in said first set of checksum values corresponding to one or more data blocks in 
said first set of data blocks, (b) comparing said first set of checksum values to said first 

30 checksum value, (c) identifying one or more first candidate data blocks as potentially being 
said first data block. 

In various embodiments, die one or more processors operable to receive and the one 
or more processors operable to identify may or may not be distinct For example, one 



processor can be operable to receive and identify. Alternatively, a plurality of processors can 
be operable to receive and/or identify. 

In an aspect of die present invention, the system further comprises one or more 
processors operable to identify one or more first candidate data blocks as corresponding to 
5 values in said first set of checksum values that are equal to said first checksum value. 

In an aspect of die present invention, the system further comprises one or more 
processors operable to generate a second set of checksum values by applying said second 
checksum algorithm to said first candidate data blocks; one or more processors operable to 
compare said second set of checksum values to said second checksum value; one or more 
1 0 processors operable to identify a second set of candidate data blocks as corresponding to 
values in said second set of checksum values equal to said second checksum value; and one 
or more processors operable to identify all data blocks in said second set of candidate data 
blocks as potentially being said first data block. 

In various embodiments, the one or more processors operable to generate, the one or 
15 more processors operable to compare, the one or more processors operable to identify a 
second set of candidate blocks, and the one or more processors operable to identify all data 
blocks may or may not be distinct For example, one processor can be operable to generate, 
compare, identify a second set of candidate blocks, and identify all data blocks. 
Alternatively, a plurality of processors can be operable to generate, compare, identify a 
20 second set of candidate blocks, and/or identify all data blocks. 

In various aspects of the present invention, the first checksum algorithm is applied to 
selected data blocks in die first set of data blocks through one of at least a linear scan or 
nonlinear scan; the nonlinear scan comprises one of a skipping scan, a modulus scan, or an 
exponential scan; each candidate data block is assigned a unique collision number, and at 
25 least one of the one or more processors comprises an integer calculation unit and one or more 
hash registers. 

In an aspect, the present invention is directed to a system for data storage comprising 
computer implemented means for generating a first checksum value for a first data block and 
a second checksum value for the first data block; computer implemented means for creating a 
30 data entry comprising the first and second checksum values; and computer implemented 
means for storing said data entry in a computer-readable medium. 

In various embodiments, die means for generating, means for creating, and means for 



storing may or may not be distinct For example, one means can generate, create, and store. 
Alternatively, a plurality of means generate, create, and/or store. 

In an aspect, the present invention is directed to a system for data storage comprising 
computer implemented means for identifying one or more attributes of a data block; 
S computer implemented means for generating a first checksum value for die data block and a 
second checksum value for the data block, wherein said first checksum value is generated by 
applying a first checksum algorithm to said data block and said second checksum value is 
generated by applying a second checksum algorithm, different from said first checksum 
algorithm, to said data block; computer implemented means for creating a data entry 
1 0 comprising data identifying: die first and second checksum values, the first and second 
checksum algorithms, and at least one of the identified attributes of the data block; and 
computer implemented means for storing said data entry in a computer-readable medium. 

In various embodiments, the means for identifying, means for generating, means for 
creating, and means for storing may or may not be distinct For example, one means can 
15 identify, generate, create, and store. Alternatively, a plurality of means identify, generate, 
create, and/or store. 

In an aspect, die present invention is directed to a system for data recovery 
comprising computer implemented means for identifying one or more attributes of a first data 
block and a second data block, said second data block comprising and different from said first 

20 data block; computer implemented means for generating a first checksum value for the first 
data block, wherein said first checksum value is generated by applying a first checksum 
algorithm to said first data block; computer implemented means for generating a second 
checksum value for the second data block, wherein said second checksum value is generated 
by applying a second checksum algorithm to said second data block, computer implemented 

25 means for creating a data entry comprising data identifying: die first and second checksum 
values, and at least one of die identified attributes of the first and second data blocks; and 
computer implemented means for storing said data entry in a computer-readable medium. 

In various embodiments, the means for identifying, means for generating a first 
checksum value, means for generating a second checksum value, means for creating, and 
30 means for storing may or may not be distinct For example, one means can identify, generate 
a first checksum value, generate a second checksum value, create, and store. Alternatively, a 
plurality of means identify, generate a first checksum value, generate a second checksum 
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value, create, and store. 



FIGURES 

Fig. 1 is a flow chart illustrating the compression steps of the present systems and 
methods; 

5 Fig. 2 is a flow chart illustrating the validation steps of the present systems and 

methods; 

Fig. 3 is a flow chart further illustrating the operation of the flow in Fig. 2; and 
Fig. 4 is a block diagram illustrating a parallel message digest processor. 

DETAILED DESCRIPTION 

10 A checksum is a calculated representation of another number. There are many types 

of checksums, including but not limited to, message digests, hashes, digital signatures, cyclic 
redundancy checksum ("CRC"), and numerical checks. A checksum may comprise a group 
of hashes, signatures, digests, or other test that can fingerprint uniquely or relatively uniquely 
a number. In this way a checksum may be more powerful than a hash, signature, etc., alone. 

IS In one embodiment, a checksum test (validation) can be performed on a selected 

number to verify if the number a checksum identical to an original number's checksum. As 
will be recognized, a number may be a subset of a larger number, whereby the checksum is 
validated against a subset of the original number. Further, subsets may be combined to form 
additional checksums, thereby adding an additional level of validation for the underlying 

20 number. For example, a large number may be divided into 16 equal parts, each having a 
checksum. Additionally, each actjacent pair of numbers (8 pairs) can have their own 
checksum, and each half of the master set can also have a checksum. 

A checksum may be created to represent any data item or combination of data hems. 
For example, checksums may be created representing the contents of a file or the name of a 

25 file. Further, a checksum may be created for the file name, its contents, and a digital 

signature for the file. As will be recognized, a checksum may be based on any numerical 
representation of data, including but not limited to file name, file size, Ale date, file length, a 
file hash or digest, etc. For ease of discussion, die terms data, data item, and data block may 
encompass any combination of data, including by way of example only, single data items, 

30 combined data items, single or combined files, file names, descriptors, metadata, blocks of 



data, etc. 

A digital file can be represented as a series of numbers. In a preferred embodiment, 
by utilizing a checksum, a data in the form of a digital file can be compressed In another 
embodiment, a message digest, hash, or digital signature may be utilized in conjunction with 
the checksum. As will be recognized, the present systems and methods can be applied to any 
file types, including binary files. In addition, a binary format could be used to describe, as 
well as encode, a compressed file. As will be further recognized, increasing computing 
power can provide better compression and more thorough testing. 

COMPRESSION 

One embodiment of data compression is shown in connection with Fig. 1. In this 
embodiment, data is compressed by running a compression program on the data. Details of 
the compression are illustrated in connection wife Fig. 1. As shown in Fig. 1, in step 102, a 
data file that will contain the compressed file data is established. This step may also included 
determining file attributes such as file size, length, name, block sizes, number of blocks, 
compression methods to be used, etc. 

In step 104, data is compressed using the desired compression methods. For example, 
data may be compressed by generating its SHA digest, MD5 digest, CRC, or any other digest, 
digital signature, checksum, or hashing method. By way of non-limiting example only, other 
digital signatures and checksums can include MD2, MD4, MD5, Ripe, SHA family, CRC16 
CRC32, and CRC64. Data can also be compressed by a Modulus, Modulus Remainder and 
Modulus Exponent, described below. In one embodiment, compression can be reversible. In 
another embodiment, compression can be nonreversible. Providing a combination of 
signatures and checksums and moduluses allows for the signature to be reversible. 

In one embodiment, data may be divided into blocks of any size before compression. 
As will be recognized, when using fixed length hashing, digests, or checksums, the larger the 
blocks the better the compression ratio. As will also be recognized, collisions may occur, and 
accordingly a collision counter may be implemented. Such counters are described in detail 
below. The result of a compression method is referred to herein as the checksum value. 

In step 106, data is written to the data file, or may be stored in temporary or 
permanent computer memory (i.e., volatile or non-volatile memory). Such data includes the 
compressed data in step 1 04. In one embodiment, the data file also contains all of the related 
details of the data. Such details may include, by way of example only, the file name, the 



original file size, block categories, block size, block identifiers, block entries, padding used 
for any blocks, checksum numbers, etc. Each block category may be identified by a unique 
block identifier. Further, each block category may contain one or more block entries, 
padding entries, block size entries, block length entries, etc. Each block entry typically 
contains one hash, digest, checksum, reverse hash, or other entry, as created by the checksum 
generator. As discussed in detail below, such entries are used when validating the data, i.e., 
the block entry checksum value is compared to another number's checksum value. In one 
embodiment multiple block entries for the same data (using different compression methods) 
may be used for mutual exclusion. 

As will be recognized, and as described in detail below, the data file may be written in 
XML, SGML, or another markup language. For example, where SHA is used to compress a 
data block, an XML entry may contain the checksum value delimited by SHA identifiers, 
e.g., "<SHA> checksum value <VSHA>". An example of such a markup file is below. 

If no other data is to be compressed, the data file is saved, and compression is 
complete (step 108), otherwise flow continues to step 104. 

One embodiment of the present system and method comprises markup based 
checksum compression with attributes as an alternative to binary compression formats. The 
markup tags can be of variable length and can be shortened, i.e., the tags can be numbered to 
decrease length (e.g., "<1 ></l > <2> </2>"). As will be recognized, using markup tags is 
clearer than a binary compression format The file can also be converted openly. 

As will be recognized by those skilled in die art, to be converted openly means there 
is an open format for XML or a non-XML text markup grammar definition for file signatures. 
Further, there may be a computer program and utility (such as the Peri program included in 
appendix A) for generating those file signatures according to a predefined grammar and 
verifying them according to Document Type Definitions (DTDs) if the signature file is XML, 
SGML, Backus-Naur, or any other grammar. Additionally, a file utility may have optional 
arguments that can exclude or modify the file signature output given command options. 
Also, it may mean that the XML format for any text markup allows for groups of checksums 
to be applied together and for checksum operations to be specifically defined in the DTD or 
grammar definition. A text format is openly described, viewable, agreed upon, extendable, 
and verifiable. In addition checksum operations can be defined and agreed upon in an open 
consistent format A binary format is less open by being in a non-text format that is not 



10 



always readable by ordinary text editors. A binary file signature format that is openly defined 
can also be used for signatures but is much less open, resulting in data not being locked in a 
proprietary compression binary. Further, this provides a means to utilize the principle of 
mutual exclusion, whereby SHA and MD5 (other any number of hashing algorithms) can be 
used to check each other. 

As discussed above, data can be compressed using any mechanism, e.g., a hashing 
mechanism. In one embodiment, data may be divided into any number of blocks and each 
block can be hashed For exanq>le, data of size 1 000 bytes can be divided into four 250 byte 
blocks, whereby each block can be hashed As will be recognized, data can be divided into a 
series of blocks n bytes long with padding or without padding. For example, before 
compression, data of size 992 bytes can be divided into four 250 byte blocks, each with 2 
padding bytes. 

In one embodiment, each block can be hashed using more than one hashing 
mechanism. Additionally, two or more blocks may be hashed independently to create two 
respective checksum values, as well as together to create a composite (third) checksum value. 
In one embodiment, these multiple checksum values may be used for mutual exclusion, 
described in detail below. Briefly, however, mutual exclusion provides a mechanism to 
increase the probability that a validated number is the correct number. For example, where a 
first block is validated (e.g., by locating a sequence of numbers having the same MD5 or 
SHA hash), an additional digest, hash value, or checksum may be validated to ensure that the 
sequence of numbers is the correct sequence. With mutual exclusion there is a decreased 
chance of more than one collision, i.c, it is statistically improbable that multiple hashes for 
die same sequence of numbers will all result in collisions. As will be recognized, any number 
of digests, hash values or checksums can be used to further increase reliability. In one 
embodiment, a number can also be processed through a checksum in reverse or be run 
through a variable length checksum which generates a variable length key from 30 to 50 
bytes. As will be recognized, SHA and MD5 are fixed byte length hash algorithms, with 
SHA would have a longer signature key than MD5. 

An exemplary compression markup file is below: 
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<£ilenan»>Horacleon.doc</ f ilenane> 
<block noe*l*> 

<block»is«>2000</blockflite> 

<bXocklength>1233</blockleaagth> 

</padd«d> 

<adS>42dcd3c7 6£ 0*69*8 614b999b dlf2acdc 8de47d88</md5> 
<aba>a9993e36 4706816a ba3*2S71 7850c26c 9cd0d894</Bba> 
<crc> 129837 12 9</crc> 
< /block no»*l*> 
<block no» # 2*> 

<blockalse>2000</blockaixa> 

<blockl«jgtti>133</blocklength> 

</paddcd> 

<Dd5>a9993a364706816aba3e25717850c26c9cd0d89d</iadS> 
<aha> 2d795c01 f*54cfdl ba6771c5 99clac64 baf lacc7</ aha> 
<crol29837129</crc> , 
< /block no-'2*> 
<bloek no»*3*> 
<blocksixa>2000</blockaize> 
_ ^ocklanath>133</blocklength> 
</paddod> 

<aadS>a9993e36470681€aba3e2S717850o26c9cd0d89a</sKS> 
<aha> 2d795c01 tt 54c£dl b*6771c5 9 9c lac 6 4 batlacc7</aha> 

<crc>129837129</crc> 
< /block: n©*-3*> 
</Clla> 



As a further example, suppose a file named "file" is compressed as "file.archive". 
The size of "file" is determined to be 120,233 bytes. It may be desired that data be divided 
into blocks. Where a block size is defined as 20000 bytes, "file" could be divided into 7 
blocks, with the last block having only 233 bytes. Each block may be run through a 
checksum generator and a checksum block entry is written for each compression method used 
on the block. For example, where block number 2 is compressed using SHA and MD5, a 
block entry for each would be included in the block category for block number 2. Other 
entries may be listed in this block category, such as the referenced block number (here "2"), 
block size (here 2000 bytes), block length, padding, etc. All blocks can be saved to 
"file.archive n that may have a file size of only 2, 1 56 bytes. An example of the file 
"filaarchive" is below. 
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<ttUun*>tll*</£il«aaoft> - 
<£)Xock no- *l*> 

<Uock*ix*>20000</hlock*ix*> 
<hlockl«cgth>3 0 O00</blockl«cgth> / 

<aharxtv*r9«>42did03e7 6f2e*9»a «14b999b <&U3333 8d»i7dB8</ B hi^T^mr^ 

. <croU983712W0000</crc> • ^ 

< /block ao*'l-> 
<block no-»2*> - 

<hlr>rVilf >200QO</bloc3aixtf> 

<Mo c M « ngtb>2000<K/hlocja«gq» 

</podd*dv 

<ito-r«vw*«>«9b00«36470681tabtf^ 

<*ha> 2d700c01 f £54cMl baS77XcS 99clac64 ba£Ucc7</«ha> 
<croUS8371292l3123321</cto 
< /block no*'2*> 

<block na»»3*>. '** 
<blocksU«>20000</])lbckSlxt> • 
^bi^««t^20(nMWblocdaen^th> 

<ghar»w>a92222<36470683 Uba3»257l7gS0c26cScd0dB9a</aha-i 
<sba> 2d799cOl ttStctOX baS771c5 99clac64 ba£lacc7VahA> 
<crol29837l29</cro - 
</block no»*3*> 
<block no«*4»> . 
<MockalM>20Q00</bl ochtf f> . 
<blpckl«9th>2000(></blocka«ngth> 

. ^»b»-rworw^9233^«470W6«bea%2571783ac26c9cdO^ 
<«h*> 2d*£fc01 t£ MUdl b*6771cS 99cUc*4 battacc7</«h*> 
<crol02839203342234</cro 

</block no-"4*> 

<blooX bo»*5*> 

.<Mnck»ixa>200Q0</bIockaiw 
<blocldenacto>20000</blocklcngth> 

<afcw«vvr««>^*a3^f470f&16aba3ft76555850c2fe9c^^ 
• <aha> 2d793c01 ffffff ba6771c5 99oUc64 bftflACC7</*ha> . 
<croU9837X291233</cro 
</block no*'5*> * 
.- : <block no-*6'> , 
<h1ockwizr»20000</hlock»t»«> / 
< M o ckl <p g th>20000</ b loc kl <n g th> 

<slk^rev«rM>«m«e36470fiBl6aba3tf57l7e50e26c9c^d8^yite-x«v«r^ 
<aha> 2d00PcOl I f— edl b&6771cS 99c1ac64 bftflacc7</afca> 
.<cxol2982344129</cxo> 
</block no«»6*> • > 

-ftj^fr HO-*7*> 

^>lock»lxa>20000</bloctol»d> 
<blockl«9th>293</blockl«8tb> 

<8ha-r«^r8e>22222«3^7068l6aba3«257178SOc26c9cdOda9d</9ha> 
<aha>. OOOOOcOl ££ 54c£dl b*£771c5 99cUc64 baflacc7</aha> 
<croU9833U9</cro 
« /block no-*7*> 
' </£ile> 



RECONSTRUCTION 

In a preferred embodiment, archived data is reconstructed by running a reconstruction 
program on the archived data. Details of the validation are illustrated in connection with Fig. 
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In step 202, a compressed file is opened This may be an archive file, or as described 
above, an XML, SGML, or other file with compressed file descriptors. File descriptors can 
include die file name, the original file size, block categories, block size, block identifiers, 
block entries, padding used for any blocks, checksum numbers, block identifier, one or more 
block entries, padding entries, block size entries, block length entries, etc. Each block entry 
may contain one or more of a hash, digest, checksum, reverse hash, or other entry, as created 
by die checksum generator. Further, the file descriptor may also contain file metadata or data 
attributes. By way of non-limiting example only, data attributes can be at least one of a size, 
length, filename, directory, root, drive, path, date, time, modified date, permission, owner, 
byte onto, and type or other properties and metadata. Each of the files attributes can be 
included in die archive signature for that file. 

In step 204, a scan is performed on a range of numbers. The scanning process is a 
method of finding a number that generates a checksum value matching a known checksum, in 
particular, examining each number in a range of numbers to find a block of data resulting in 
die same checksum value as a particular block entry. In various embodiments, different 
scanning methods may be used including, by way of example only, linear scans, non-linear 
scans, skipping scans, exponential scans, restrictive scans, modulus scans, etc. As will be 
recognized, different scanning methods may improve scan times by reducing the set of 
numbers in the scan range. As will be further recognized, non-scanning methods may also 
be used in connection with scanning to simplify the scanning process. For example, 
determining if a number is prime, composite, even, or odd may be used as a step in the 
validation process. Various scanning methods are discussed in detail below. 

In one embodiment, each number in die scan range is verified against a known 
checksum. In another embodiment, each number in the scan range is verified against a 
known checksum, and where the verification is a success, the number is verified against one 
or more subsequent checksums (as described in connection with mutual exclusion). In 
another embodiment, each number in the scan range is verified against all known checksums. 

In various embodiments, nonlinear or skipping scans may used during scanning. 
Nonlinear or skipping scans are scans that that can be out of sequential order or may skip 
numbers, e.g., skipping or nonlinear scans could skip every odd number, every even number, 
every 5th number, or scan randomly. Additional embodiments include other scanning 
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methods, including a permutation scan (i.e., a checksum scan that can run against a series of 
permutations), a parallel scan (i.e., a checksum scan that can divide the scanned range 
between multiple computers, microprocessors, processes, etc.), a restriction scan (i.e., a scan 
that can be run against a sequence of numbers, e.g., from 0 to N, or from M to N). With a 
5 restriction scan, a floor or ceiling may be defined such that the scanning is only performed on 
numbers between the floor and ceiling. These and other scanning embodiments are discussed 
in detail below. 

For example, an iiqmt 0(lnput) may have a checksum output N. In order for a 
number M(hashed) to be equal to the 0(lnput) it must equal its checksum output(N2). So N 
10 = N2 for 0 = M. This aspect of scanning may be expressed by the following formula: 

0(Input) = Checksum Output(N) 
M(hashed) = Checksum Output(N2) 

In one embodiment, a linear scan is used This is the most basic scan, where a range 
15 of numbers, e.g., from 1 ton, is scanned incrementally. Reverse scans, from n to 1, may also 
be used Other scanning methods are described below. 

In a preferred embodiment, a checksum can be used to find a number that is 
associated with the digital signature, message digest, checksum, hash, etc. of a known 
number. Any scanning method may be implemented to validate the checksum. For example, 
in a linear scan, a number is incremented and tested to see if it matches the known checksum. 
If it does not match, the number is again incremented and tested These steps are repeated 
until the checksum is either found or a maximum number has been reached As will be 
recognized, other scanning methods, as mentioned both above and below, may be utilized to 
validate checksums. 

In one embodiment a parallel scan is used An illustrative example of a parallel scan 
follows. A scan range exists from 0 to 10029000. The scan range can divided between 2 or 
more processors and each processor is assigned a corresponding fraction of that range. Here, 
4 processors each are allocated l A of the range. 10029000 is divided into 4 parts and each 
CPU runs the checks to determine if one number outputs the same checksums. If the number 
generates an identical checksum, computer processing ends. Note that additional processing 
may take place where mutual exclusion is incorporated Mutual exclusion is described in 
greater detail below. For example, if number 12333 generates the correct MD5 hash, then the 
number has potentially been found Mutual exclusion verification provides assurance that the 
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number is correct, e.g., where the number also generates the correct CRC and SHA, mere is a 
greater likelihood that the number is correct A pair of digital signatures can also be created 
which may have collisions but are designed to produce different signature results when used 
together. This is an example of mutual exclusion. For example digital signature xyz can be 
designed to produce a different signature man digital signature zzz. 

The following illustrates a parallel scan. 

CPU1 CPU2 CPU3 " CPU4 

0-2507250 2507251-50145499 5014500-72521750 72521751-10039000 

Located Number 12333: 
CPU 1 Checksum Result: 

MD5: 42dcd3c7 6f0e69e8 614b999b d1f2acdc 8de47d88 
SHA: 42dcd3c7 6f0e69e8 614b999b d1f2acdc 8de47d88 1231231 
CRC: 123213231213 . 
Result Number 12333 matches the original checksums ' 

Original Checksum: . ' 

MD5: 42dcd3c7 6f0e69e8 814b999b d1f2acdc 8d©47d88 

SHA: 42dcd3c7 6f0e69o8 814b999b d1f2acdc 8de47d88 1231231 

CRC: 123213231213 

In one embodiment, there may be collisions outside a particular range of numbers, 
and accordingly, a scan could iterate through a defined range to find a number X that is 
associated with a checksum Y. This is often called a restriction scan. A restriction scan can 
define a starting point as well as an ending point for a scan, e.g., from 12 to 123. Another 
example of a restriction scan can be a collar, such that a number N to the X th power acts as a 
starting point with a number M to the y* power as an ending point 

In another embodiment, a skipping scan can be used to reduce the number of 
iterations in a scan. Such a skipping scan can skip numbers during the scanning process. For 
example, the scan could skip odd or even numbers, every n* number, or a defined skipping 
set, and perform checksums on the relevant numbers. 

In another embodiment, a modulus scan can be used. A modulus scan is described in 
detail below. 

In another embodiment, and as mentioned above, mutual exclusion may be used to 
increase the speed of the validation, as well as to ensure the uniqueness of the data. As will 
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be recognized, message digest, hashes, and digital signatures are similar to checksums in that 
they provide an alternate method to represent a known data item. In the present system and 
method, multiple checksums, signatures, hashes, digests, etc., can be used to validate a 
known checksum based on the principle of mutual exclusion In one embodiment, mutual 
5 exclusion is the result of pairing two or more signatures or checksums so that die signatures 
and signature results validate other signatures and signature results against collisions. For 
one digital signature or hash or checksum there can be a collision, but two checksums can 
produce different results and thereby increasing the probability that the correct number has 
been found. Digital signatures can also be created that produce different results for different 
1 0 inputs and minimize collisions. 

By providing more than one mathematical representation of a known data item, file, 
block, etc., mutual exclusion ensures that there is a higher statistical probability of uniqueness 
when validating the data through scanning. Where a collision occurs for one message digest, 
it is unlikely to also be a collision for a second digest Further, two digests can be used to 
1 5 generate a different and distinct key for a given input As will be recognized, based on the 
principle of mutual exclusion, 2 or more checksums using different checksum methods can be 
used to verify that a checksum number validates against the original number. With 3, 4, or 
more checksums, each using a different checksum method, it is even more unlikely that 
collisions will occur in all of the digests. 

20 For exanq>le, the following iiqnits generate the same MD5 hash: 



This is the hex input for file mdScolll : 
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This is the hex input for file mdco!12 
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dd 53 e2 34 87 da 03 £d 02 39 63 06 d2 48 cd aO 
e9 9f 33 42 Of 57 7e e8 ce 54 b6 70 80 28 Od le 
C6 98 21 be b6 a8 83 93 96 f9 65 ab 6f tl 2a 70 

They both generate this md5 digest: 79054025255fbla26e4bc422aef 54eb4. 

Accordingly, by providing more than one mathematical representation for the data 
inputs, mutual exclusion ensures that there is a higher statistical probability of uniqueness 
when validating the data through scanning. Below is an example of digests included in an 
XML markup file, whereby different digests have been created for the above inputs, 
illustrating the concept of mutual exclusion: 

<?xml versions* 1.0" encoding s"ascii"?> 
<!DOCTYPE COMPRESS__ARCH SYSTEM ■con5>ress2 .dtd"> 
<ARCH> 
<PILE> 

<MAME >md5col 1 1 < /NAME> 
<SIZE>128</SIZE> 

<md t=-SHA->al9e89df4ccb344a£4f0372907d8ad7d40296ea5</md> 

<mdr t=»SHA">5aebda61fc4bc8ff3621f6304076b491a97a57ec</mdr> 

<md t="RIPE160">9c20bd94bc95d93dddd607ebdfOe2944061ab816</md> 

<mdr t="RIPE160»>d43b384a046b91536ab6ccl847ff4f906ba0e535</mdr> 

<md t=«MD5">79054025255fbla26e4bc422aef54eb4</md> 

<rod t="MD5_REVERSE">63692f882033b4e2cl3d437f35e33271</md> 

<md t="MD4">4dca7748578ceefbl8de6ea42af36aed</md> 

<md t = "MD2 ■ >85cf 98862 5dl5427 9dl IdeS 9bf 3 77cc3 </md> 

<MODULUS>+9611342</MODULUS> 

<PLOORx/FLOOR> 

<CEILx/CEIL> 

</PILE> 

<PILE> 

<NAMB>md5coll2</NAME> 
<SIZE>128</SIZE> 

<md t="SHA">9ed5c62e6678248ab42c69961720b910e3618288</md> 

<mdr t="SHA">fbe543c5b550374b3f4818dc24e80af8615dl91c</mdr> 

<rad t=»RIPE160->70f686e0ae36f7e0d59da69b473749e92c087740</md> 

<ndr t= w RIPE160 ,, >e3b2a7b5f2630314a8b77e2aa429cd308eOc7871</mdr> 

<md t=»MD5 ,, >79054025255fbla26e4bc422aef54eb4</md> 

<rad t= ,, MD5_REVERSE ,, >b£5bf87f65e79b98af7985885f3e5ee0</md> 

<rad t="MD4*>7a9919f9efb2ecael7012dcf94edc983</md> 

<md t="MD2">358aba7632d39f6c41f400eedb7b31de</md> 

<MODULUS >+ 4 9 8 1 022 </MODULUS> 

<FLOORx/FLOOR> 

<CEILx/CBIL> 

</PILE> 

</ARCH> 

In one embodiment, mutual exclusion verifies the least processor intensive hash, 
digest, or checksum associated with a particular data item first In this way, the processor is 
not required to calculate intensive hashes, digests, or checksums for every number in the scan 
range. Accordingly, the number of processor intensive steps can be reduced Additionally, 
by incorporating a collision counter (discussed in detail below), subsequent collisions can 
incorporate a unique collision number in order to uniquely identify the validation. 
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Subsequent mutual exclusion tests are only performed on the matching number. A cost 
pyramid can also be obtained for the higher cost signature digests or checksums being 
processed and tested last (and only if a previous checksum or test is successful) and the lower 
cost digests being processed first If there are three digests it can only check the last 2 if the 
previous or prior checksum or digital signature test succeeds. 

In step 206, where a checksum match has been found, the number resulting in die 
checksum match is written to an output file. If the checksum found is the first match, the 
checksum file may be created In one embodiment, a block identifier determines the position 
of the found number in the output file. For example, a number corresponding to die identifier 
for block 1 would be placed in the first position of die output file, and a number 
corresponding to die identifier for block 2 would be appended to the end of block 1 , but 
would appear before the number in block 3. In this way, multiple computers, processors, 
including distributed networks may be used to decompress files while at die same time 
preserving die desired order in the output file. 

If all data has been verified, validation is complete and die output file is saved (step 
208). If additional data is to be verified, flow continues to step 204. 

Fig. 3 is an example of the flow described in Fig. 2. As illustrated in step 302, a 
decompression program loads in a file named "file.archive. w In step 304, it scans numbers 
from a to b and runs them through SHA and SHA-reverse and CRC to find a number that 
matches die SHA signature. In step 306, the blocks are iterated on from block 1 to block 7 to 
find a group of numbers that matches the block 1 to block 7 hash or checksum. SHA-reverse 
is die block in run through SHA in reverse. In step 308, after a number passes all of the 
checksums, it is output as a block. A description of the iteration process follows. 

Iteration #1, Block #1 

If Checksum is False for block entry 1 with respect to the defined compression 
methods 

Processing continues to 

Iteration # 102999990128932901, Block # 1 

If die Checksum is True for block entry 1 with respect to die defined 

compression methods 

The block is output to file.unarchive. 
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Processing continues for all next blocks. 

Iteration # 1, Block # n, to Iteration # n, Block # n. 

MODULUS SKIPPINfi 

Modulus skipping is related to the Chinese Remainder Theorem. This theorem 
answers questions such as: there is a number n divided by 233 which is, 3 and n divided by 6 
is 7. What is the number? Modulus skipping involves answering the question such as: & 
is a binary number n which divided by a million (the modulus) is 1233, and whose SHA 
signature is x, and whose MD5 signature is Y, and whose length is 123 bytes. What is the 
integer binary number of 123 bytes? 

Modulus skipping is a type of skipping scan as described above. A file typically 
constitutes a unique number in bits (i.e., a 2 byte file has a length of 16 bits and a numerical 
value from 0 to 65,536). A file number can represent the number formed by the bits of the 
file. A file number can be divided by another number to get a remainder. A smaller number 
will go into a larger number n times with a remainder x (modulus and remainder). In a 
preferred embodiment, a file number can have the modulus of its hash used as an iterator. In 
this way a file number can be represented as a multiple of its modulus plus the remainder. . 
This allows a file to be searched for by iterating over the (modulus * iterationnumber + die 
remainder) to find a signature match that equals the starting signature of an unknown file 
number block. 

For example, a string can be compressed according to the present system and 
methods. In this example, the term "hash" describes the result of the compression method. 
To find die original string one can scan over all combinations of strings to find one with die 
same hash or iterate over the modulus and the remainder to skip-scan for a data block that 
matches a hash. 

For example, 1000 mod 15 = 10, or for the number 1000, using the modulus 15, has a 
remainder of 10. In other words, the number 1000 divided by 15 equals 66 (and 15 * 66 = 
990) with 10 remaining. A modulus scan can use this to reduce the number of iterations 
required to find a matching hash. Here, a modulus scan for the number 1000 would iterate 66 
times versus 1000 for a non-modulus scan. The following table illustrates this point 

Iteration n * Modulus + Remainder = Result Hash 

1 15 +10 =25. abcccc 

2 * 15 +10 =40. deeeee 

3 * 15 +10 =55. xeeeee 
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4 * 15 +10 =70. eeerrrr 

66 * 15 + 10 = 1000. eeeeee 



Accordingly, one may iterate over the modulus and hash die number until you find a 
number that matches a particular hash, rather than iterating through all the number 
combinations and taking the hash. In this way, one may iterate through a number's multiple 
of its modulus plus its remainder to find a number that matches a hash. This may seem very 
small until one takes a 1 000 bit number block and the modulus of a 128 bit and the remainder 
number, which speeds up the hash search exponentially. This method can speed up a digital 
signature search by trillions of times and improve the uniqueness. 

Below is an example of an iterative, non-skipping scan. String "a" is hashed and then 
a string is iterated over the permutations until a string number matches the hash from 0 Null 
to 97 Null in a c string. 

The digest of string a is 86F7E437FAA5 A7FCE1 5D 1 DDCB9EAEAEA377667B8 

Digest String: 86F7E437FAA5A7FCE15D1DDCB9EAEAEA377667B8: string a! 



String Digest DA39A3EE5E6B4BOD3255BFEF95601890AFD80709 0 0 0 

String Digest BP8B4530D8D246DD74AC53A13471BBA17941DPP7 10 1 

String Digest C4EA21BB365BBEEAP5P2C654883E56D11E43C44E 2 0 2 

String Digest 9842926AF7CAOA8CCA12604F945414F07B01E13D 3 0 3 

String Digest A42C6CF1DE3ABFDEA9B95F34687CBBE92B9A7383 4 0 4 

String Digest 8DC00598417D4EB788A77AC6CCEF3CB484905D8B 5 0 5 

String Digest 2D0134ED3B9DE132C720PE697B532B4C232PP9FE 6 0 6 

String Digest 5D1BE7E9DDA1EE8896BE5B7E34A85EE16452A7B4 7 0 7 

String Digest 8D883P1577CA8C334B7C6D75CCB71209D71CED13 8 0 8 

String Digest AC9231DA4082430APE8P4D40127814C613648D8E 9 0 9 

String Digest ADC83B19E793491B1C6EA0PD8B46CD9P32E592PC 10 0 10 

String Digest 067D5096P219C64B53BB1C7D5E3754285B565A47 11 0 11 

String Digest 1E32B3C360501A0EDE378BC45A24420DC2E53PBA 12 0 12 

String Digest 11F4DE6B8B45CP8051B1D17PA4CDE9AD935CEA41 13 0 13 

String Digest 320355CED694AA69924P6BB82E7B74P420303PD9 14 0 14 

String Digest C7255DC48B42D44P6C0676D6009051B7E1AA885B 15 0 15 

String Digest 6E14A407PAAE939957B80E641A836735BBDCAD5A 16 0 16 

String Digest A8ABD012EB59B862BF9BC1EA443D2P35A1A2B222 17 0 17 

String Digest C4P87A6290AEE1ACPC1P26083974CE94621FCA64 18 0 18 

String Digest 5A8CA84C7D4D9B055P05C55B1P707P223979D387 19 0 19 

String Digest 3CE0A1AF90B6E7A3DD8D45E410884B588EA2D04C 20 0 20 

String Digest 7762EABF93B7PE8EC5D648CD3B1D9EB6D820CAA2 21 0 21 

String Digest A9D3C9CD54B1A392B21EA14904D9A318P74636B7 22 0 22 

String Digest 094D98B399BP4ACE7B8899AB7081E867PB03P869 23 0 23 

String Digest C2143B1A0DB17957BEC1B41BB2E5P75AA135981E 24 0 24 

String Digest E9C5D7DB93A1C17D45C5820DAF458224BFA7A725 25 0 25 

String Digest EBDC2288A14298F5P7ADP08E069B39PC42CBD909 26 0 26 

String Digest 27F57CB359A8F86ACF4AF811C47A6380B4BB4209 27 0 27 

String Digest B830C46D24068069P0A43687826F355B21PDB941 28 0 28 

String Digest 5983AD8P6BFEA1DEDA79409C844F51379C52BE2D 29 0 29 
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String - Digest 7PD88C329B63B57572A0032CP14E3E9EC861CE5P 30 0 30 
String - Digest 953EPE8P531A5A87P6D2D5A65B78B05E55599ABC 31 0 31 
String Digest B858CB282617FB0956D960215C8E84D1CCF909C6 32 0 32 
String ! Digest 0AB8318ACAF6E678DD02E2B5C343ED41111B393D 33 0 33 
5 String " Digest 2ACE62C1BEFA19E3EA37DD52BE9F6D508C5163B6 34 0 34 

String # Digest D08P88DP745PA7950B104E4A707A31CPCE7B5841 35 0 35 
String $ Digest 3CDP2936DA2PC556BPA533AB1EB59CE710AC80E5 36 0 36 
String % Digest 4345CB1PA27885A8PBPE7C0C830A592CC76A552B 37 0 37 
String & Digest 7C4D33785DAA5C2370201PPA236B427AA37C9996 38 0 38 
10 String • Digest BB589D0621E5472P470PA3425A234C74B1E202E8 39 0 39 

String ( Digest 28ED3A797DA3C48C309A4EP792147P3C56CFEC40 40 0 40 
String ) Digest E7064POB80P61DBC65915311032D27BAA569AB2A 41 0 41 
String * Digest DP58248C414P342C81E056B40BBE12D17A08BP61 42 0 42 
String + Digest A979EP10CC6F6A36DP6B8A323307EB3BB2B2DB9C 43 0 43 
15 String , Digest 5C10B5B2CD673A0616D529AA5234B12EE7153808 44 0 44 

String - Digest 3BC15C8AAE3E4124DD409035P32EA2FD6835EFC9 45 0 45 
String . Digest 3A52CE780950D4D969792A2559CD519D7EE8C727 46 0 46 
String / Digest 42099B4AP021E53FD8PD4E056C2568D7C2E3PPA8 47 0 47 
String 0 Digest B6589FC6ABODC82CF12099D1C2D40AB994B8410C 48 0 48 
20 String 1 Digest 356A192B7913B04C54574D18C28D46E6395428AB 49 0 49 

String 2 Digest DA4B9237BACCCDP19C0760CAB7AEC4A8359010BO 50 0 50 
String 3 Digest 77DE68DAECD823BABBB58EDB1C8E14D7106E83BB 51 0 51 
String 4 Digest 1B6453892473A467D07372D45EB05ABC2031647A 52 0 52 
String 5 Digest AC3478D69A3C81PA62E60P5C3696165A4E5E6AC4 53 0 53 
25 String 6 Digest C1DFD96EEA8CC2B62785275BCA38AC261256E278 54 0 54 

String 7 Digest 902BA3CDA1883801594B6E1B452790CC53948PDA 55 0 55 
String 8 Digest PE5DBBCEA5CE7E2 9 8 BB8 C6 9BCFDFDE8 9 0 4AABC1 P 56 0 56 
String 9 Digest 0ADE7C2CP97F75D009975P4D720D1PA6C19P4897 57 0 57 
String : Digest 05A79P06CF3F67P726DAE68D18A2290P6C9A50C9 58 0 58 
30 String ; Digest 2D14AB97CC3DC294C51C0D6814P4EA45F4B4E312 59 0 59 

String < Digest C4DD3C8CDD8D7C95603DD67P1CD873D5F9148B29 60 0 60 
String = Digest 21606782C65E44CAC7APBB90977D8B6P82140E76 61 0 61 
String > Digest 091385BE99B45P459A231582D583EC9P3PA3D194 62 0 62 
String ? Digest 5BAB61EB53176449E25C2C82F172B82CB13FFB9D 63 0 63 
35 String a Digest 9A78211436P6D425EC38P5C4E02270801P3524P8 64 0 64 

String A Digest 6DCD4CE23D88E2EE9568BA546C007C63D9131C1B 65 0 65 
String B Digest AE4F281DP5A5D0PF3CAD6371F76D5C29B6D953EC 66 0 66 
String C Digest 32096C2E0EFF33D844EE6D6754O7ACE18289357D 67 0 67 
String D Digest 50C9E8D5PC98727B4BBC93CF5D64A68DB647P04P 68 0 68 
40 String B Digest E0184ADEDP913B076626646D3F52C3B49C39AD6D 69 0 69 

String P Digest E69P20E9P683920D3PB4329ABD951E878B1P9372 70 0 70 
String G Digest A36A6718P54524D846894PB04B5B885B4E43E63B 71 0 71 
String H Digest 7CF184P4C67AD58283ECB19349720B0CAE756829 72 0 72 
String I Digest CA73AB65568CD125C2D27A22BBD9E863C10B675D 73 0 73 
45 String J Digest 58668E7669PD564D99DB5D581FCDB6A5618440B5 74 0 74 

String K Digest A7EE38BB7BE4PC44198CB2685D9601DCP2B9F569 75 0 75 
String L Digest D160E0986ACA4714714A16P29EC605AF90BE704D 76 0 76 
String M Digest C63AE6DD4PC9F9DDA66970E827D13P7C73PE841C 77 0 77 
String N Digest B51A60734DA64BE0E618BACBEA2865A8A7DCD669 78 0 78 
50 String 0 Digest 08A914CDE05039694EP0194D9EE79FP9A79DDE33 79 0 79 

String P Digest 511993D3C99719E38A6779073019DACD7178DDB9 80 0 80 
String Q Digest C3156E00D3C2588C639E0D3CP6821258B05761C7 81 0 81 
String R Digest 06576556D1AD802P247CAD11AE748BE47B70CD9C 82 0 82 
String S Digest 0 2 AA6 2 9C8B1 6 CD1 7A4 4 F3 A0EPEC2 PEED4 3937642 83 0 83 
55 String T Digest C2C53D66948214258A26CA9CA845D7AC0C17P8E7 84 0 84 

String U Digest B2C7C0CAA1OA0CCA5EA7D69E54018AE0C0389DD6 85 0 85 
String V Digest C9EE5681D3C59P7541C27A38B67EDP46259E187B 86 0 86 
String W Digest E2415CB7F63DP0C9DE23362326AD3C37A9ADPC96 87 0 87 
String X Digest C032ADC1PP629C9B66P22749AD667E6BEADP144B 88 0 88 
60 String Y Digest 23EB4D3P4155395A74E9D534P97FF4C1908P5AAC 89 0 89 

String Z Digest 909P99A779ADB66A76FC53AB56C7DD1CAP35D0PD 90 0 90 

22 



String [ Digest 1E5C2F367F02E47A8C160CDA1CD9D91DECBAC441 91 0 91 

String \ Digest 08534F33C201A45017B502E90A800F1B708EBCB3 92 0 92 

String ] Digest 4FF447B8BF42CA51FA6FB287BED8D40F49BE58F1 93 0 93 

String A Digest 5E6F80A34A9798CAFC6A5DB96CC57BA4C4DB59C2 94 0 94 

String _ Digest 53A0ACFAD59379B3E050338BF9F23CFC172EE787 95 0 95 

String % Digest 7E15BB5C01B7DD56499E37C634CF791D3A519AEE 96 0 96 
String a Digest 86F7E437FAA5A7FCE15D1DDCB9EAEAEA377667B8 

Found string a with Digest 86F7E437FAA5A7FCEI5D1DDCB9EAEAEA377667B8 
Digest__string 86F7E437FAA5A7FCEI5D1DDCB9EAEAEA377667B8 

In one embodiment, using a modulus scan can reduce the total number of 
iterations to 9 or 10. For example, 97 mod 10 = 7, so every 10th number plus 7 is hashed to 
verify if the hash matches the original string hashed by a signature. If a match occurs, the 
string has been found. 

The digest of string a is 86F7E437FAA5A7FCEI5DIDDCB9EAEAEA377667B8 
Digest String: 86F7E437FAA5 A7FCE 1 5D 1 DDCB9E AEAEA3 77667B8: string a! 



String Digest 5D1BB7E9DDA1EE8896BE5B7E34A85EE16452A7B4 7 0 7 

String Digest A8ABD012EB59B862BF9BC1EA443D2F35A1A2E222 17 0 17 

String Digest 27F57CB359A8F86ACF4AF811C47A6380B4BB4209 27 0 27 

String % Digest 4345CB1FA27885A8FBFE7C0C830A592CC76A552B 37 0 37 

String / Digest 42099B4AF021E53FD8FD4E056C2568D7C2E3FFA8 47 0 47 

String 9 Digest 0ADE7C2CF97F75D009975F4D720D1FA6C19F4897 57 0 57 

String C Digest 32096C2E0EFF33D844EE6D675407ACB18289357D 67 0 67 

String M Digest C63AE6DD4FC9F9DDA66970E827D13F7C73FE841C 77 0 77 

String W Digest E2415CB7F63DF0C9DE23362326AD3C37A9ADFC96 87 0 87 

String a Digest 86F7E437FAA5A7FCE15D1DDCB9EAEAEA377667B8 97 0 97 



Found string a with Digest 86F7E437FAA5 A7FCEI 5D1 DDCB9EAEAEA377667B8 
Digest_string 86F7E437FAA5 A7FCE1 5D1 DDCB9EAEAEA377667B8. 
The commented source code zsha_str.c, included in the attached Appendix 2, 
comprises a novel modification of the public domain program shal.c, with the addition of an 
SHA iterator iterate function and a modification of the main function for strings. 

In one embodiment, where a file is small, one can append die hash to a filename. For 
example, 

testxyz with contents of "a" (as in the above example) becomes: 

test.xyz.xz.sha.86F7E437FAA5A7FCE15DlDDCB9EAEAEA377667B8.xz, 
and the contents of the file (10 to 100 bytes) can be zero instead of 1. This enables 
very small files which are usually stored as 4k blocks to take advantage of 255 character 
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filenames and hash filename appendages to encode the data as length 0. Note here that xyz, 
and xz may be additional descriptors for die file. For example testxyz.xz may represent 
testarchive.01. 

During validation, a scan would look for the filename, as well as a conqnession 
method (here as HASHTYPE), and a checksum value (here as HASHDIGEST). Other 
descriptors may be appended to the filename as well. For example, the modified filename 
could be represented as: 

[FILENAME] jcz.[HASHTYPE].[HASHDIGEST]^z, or 
[FILENAME] Jtz.[HASHTYPE].[ HASHDIGEST]. [MODULUS]. [H^ 

The Activeperl 5.6 sha6b.pL file included in Appendix 2 adds the XML and DTD 
header information, as well as directory recursion, floor and command options, and ceiling 
calculation, It also includes the modulus calculation. It also has extended options to pick 
which things to run. The Perl script further demonstrates the utility and usefulness of XML 
signature beyond the W3C. The Perl files are dependent on some known windows Activeperl 
modules. In addition, the compress.xml files can be zipped, decreasing die size by about 3 to 
1 over the xml output Also included in Appendix 2 are the palindrome.txtxml and 
compress_moduIus2.xml files. 

MARKUP COMPRESSION LAYOUT 

Compression markup files can be XML, SGML, or binary with certain markup tags. 
A markup language can be used along with a document type definition ("DTD") to define the 
compression file. A DTD can be used to validate that the compression file is correct and has 
not been damaged or contains data to allow decompression. An example of a DTD file 
follows. 



24 



<tllmmm>tlli/Zllmamm> — su twUtte i«tt 

<biock no-*l*> . — lb* block tag la tba ftla bloak 1 witb tt 

<bIoctalaa>20OOO</hloclraUa> — tba blooaaiaa tag laaotaa i _ _ 

<fa IoekUogth>gOOOO</hIociaaWh> — tba blocbloaatb teotaa tba Laoctb of tba 
«/pflOdad> , — tba pa ( Min g tag daao t aa abatba* tba faloab im 



<aba-nVaXM>«2ted03e7 6£2o69«6 *14b999b (1X12333) Ma47dM</aba-r 
ambaab of tba -ta aa aao ft bytao of bloob 1. 




<aba>aIUlaM 47O6U0a balo2371 7U0c2«c 9ottda§4</aba> — tba aba tag la 



<crts>125«371»0000(X/crc> » fte cm taj If tbt ex of tat miodb. 
</blodt oo»*l»> — tti oaf of block l* 
of bloob a. 

<b liir klaagtt>10006</biocldoagtb> 




4706tl6aba3oasn7B50e3fe»cd0tt9d</ab/ 
MTQOcOl ffSiofdl barmcS ttolact* baflaoo7<t/afao> 
>12»i37129213123321«/e 
</block no»*2*>. 
<block ao«*3»> 




4706tlaoba3aasn7t30oaaoMOoa^/abap-raa«rao> • 
<aha> MTJScOl UUrfOX baaTTlcS f9olacf 4 baflocc7</aha> 
<crol»«37U9</cre> 
«/block ao-"3*> 

<block oo»*4*> * * 

«bloc«alf a»30000< / blocka iaa* . 

a<yo<xmocaaapgtix> 
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>2000O</blockalxa> 
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<tolock aow©»> 
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<aba> ZdOOfcOl f faaaadl batfTOcS 99clact4 baflacc7</aba> 
<croima344ia9</ezo 
« /block ao»*S*> 
<bloek ao-»7»> 



«bloctlao g cb>2S3</ bi nc»l an g tb> 
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<tba> OOOOOcOl ttUcttl ba*77lc3 99dac64 b*flacc7</aba> 
<aeol2M3312»«/cre> 
< /block no»*7 - > 

</ftla> — tba flla tag la tba and c 



MESSAGE DIGEST PROCESSOR 

In one embodiment, a message digest processor is preferably adapted to perform the 
systems and methods described above. Such a processor can be a custom microprocessor 
5 dedicated to finding blocks of data associated with one or more checksum values. It is 
related to data compression with multiple message digests. It can speed up digest 
compression thousands millions or billions of times and make digest checksum compression 
feasible. In one embodiment, a digest processor could have any number of processors, e.g., 
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hundreds to thousands of processor cores and thousands of processors depending on 
configuration and transistor size, and other features including memory core size and fixed 
programming or hard wired programming. In one embodiment, a message digest processor 
can be one processor. 

Fig. 4 is a diagram of a message digest processor. Shown in Fig. 4 is a 12 core 
message digest processor, 402. Said processor comprises 12 parallel processors, however, 
this is exemplary only, and any number of processors may be used. Also shown in Fig. 4, are 
processor die 404, coordinator CPU 406, chip package 408, and parallel digest cores 410a- 
410a Elements of these components are described below. Exemplary features of a Message 
Digest Processor (MDP): 

Allows a digest to be authenticated rapidly in parallel. 

A MDP can validate whether a given checksum is unique or locate a binary number 
associated with a given checksum. 

A Message Digest Processor can exist as a computer coprocessor that assists with 
verifying and validating signatures against a range of numbers to find the original binary 
input 

Multiple Message Digest Processor Chips can be chained together in Parallel. 

A Message Digest Processor can have a master processor, 406, that determines 
whether execution halts or continues when a binary signature number is found or coordinates 
execution. 

A Message Digest Processor can be termed a microcore architecture. A microcore 
processor consists of hundreds or thousands of small processor cores on a single processor 
chip die. A large core processor has less than a hundred cores. 

A transistor chip die is divided into parallel processing cores 410a-410n with a 
minimum size. Each core has a scratchpad or buffer memory and program memory to test a 
binary number for a digest or digests against a given digest or modulus. 

Processor instructions consist of a signature group bundle and modulus and byte size. 
Processor instructions can also consist of program to code to process a digital signature. The 
instructions are VLIW or very large instruction words and can be several 100 bytes long. 
Instructions can have variable length data. Instructions can consist of data and instruction 
signature code. 

The processor can have multiple dedicated processor cores per processor, 
Processor SIMD (Single Instruction Multiple Data). 
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The processor can find a group of n bytes that correspond with a given signature or 
modulus or checksum or a group of each. 

Processor instructions can have groups of digital signatures and modulus pairs that 
compose a checksum associated with a specified size binary byte to find a specified binary 
5 number that matches the checksum group. 

Independent Parallel processing allows each unit to run independently. 

The checksum or signature code can be distributed to the processors. 

Each processor can halt the processing if a binary number is found that corresponds 
with a given signature or modulus or continue if there are any collisions. 
10 Each processor core is a minimal number of transistors for a maximal processing. The 

68000 has 68,000+ transistors but would be an ideal size as a multicore signature processor. 

Each processor core has a 5 12 or larger or smaller byte integer (big integer) 
calculation unit or memory buffer to allow numbers to be tested against a checksum or 
signature in parallel and iteratively to find a matching number with a set group of signatures. 
15 Each processor has programming sufficient to calculate multiple digests (SHA, MD5, 

Ripe ect) concurrently and store the results in registers for comparison with a given 
checksum group. 

Each processor core has hash registers associated with the different digests (SHA 
register, MD5 register, Ripe register) as embedded code or loadable instruction code. The 
20 instruction can sit in a buffer and be distributed to the cores in a processor. 

The Digest Instructions for a given Digest can be hard wired on the chip or 
programmed in a buffer. 

For example, if a processor has a budget of 300 million transistors you could pack in 
1000 or more processors for a core size of 300,000 transistors per processor core. Each 
25 processor core contains registers and large integer support of 512 bytes or larger and multiple 
message digest registers and program memory. A processor could even have a million cores. 

In one embodiment, a processor instruction consists of any number of digests 

(checksum values), a modulus number, modulus remainder, modulus exponent, and collision 

numbers if there is one for a given group of digests, and checksum instruction code for the 

30 different digests. These instructions can be loaded into core memory at run time or from a 

central flash chip which can update the message digest programming and distribute it to die 

cores. Each core is assigned an iteration number to multiply the digest block by the core 

number. As described above, if there are 1000 cores then an iteration could be described as n 

* the modulus number * the core number + the biginteger register number. The remainder is 
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loaded into a 512-byte big integer register. If there is an exponent the modulus number is 
raised to the power of the exponent and added to the remainder in the big integer register. 
Each iteration the processor increments the 512-byte big integer by ((processornumber + 
startingnumber) the modulus number + (the register number or the remainder)). The 
processor runs any number of checksum tests on the biginteger (which may be any size, but 
here is 5 1 2 bytes or larger), and tests to see if there is a match with a given package of 
checksums (described above). 

For example, if processor core 12 finds a 512-byte integer that matches a given 
message digest or group then it can return an interrupt to a watch-keeper processor. A watch- 
keeper processor determines if there are any possible collisions and halts the processors and 
returns the big integer Mode of bytes that are in core 12 if there are no other collisions. 
Having more than one digest allows for mutual exclusion. Having a 5 1 2-byte block with 3 or 
4 digests of 20 bytes and a modulus remainder and exponent allows for parallel and 
distributed searching of a block matching the digests. A modulus can also be chosen that 
produces a remainder with the least number of digits. 

A parallel modulus scan can be demonstrated by an iteration table from a group of n 
processors * n cpu cores pa* processor. To clarify this it should be the cpu iteration result * 
modulus + remainder (eg., a number's hash or digest). It could also be the modulus raised to 
a power + the modulus * n cpu iteration result + the remainder hashed. This is related to the 
Chinese Remainder Theorem which deals with questions such as: there is a number n whose 
remainder divided by 1 2 is 3 and divided by 1 7 is 6 and divided by 8 is 2, what is the 
number? The Chinese Remainder Theorem will put this result in terms of an equation n * 
modulus + the remainder (i.e., nu + remainder where n is the modulus and u is an unknown 
number). A proper variant of the Chinese Remainder Theorem is (n A z + n * u + remainder 
= some number which characterizes the numbers with a logarithm and exponent where n is 
the modulus and u and z are some number plus the remainder. An example is 45 modulus 2 
which is remainder 1. The number 2 A 5 = 32. 2 * 6 = 12. So (2 A 5) + (2 * 6) + 1 = 45. The 
number 47 mod 3 = 2. 3 A 3 = 27. 2 * 9 = 18. So (3 A 3) +(2 * 9) + 2 « 47). 

For example, take the number of processors (as an example 10 processors with 1 core 
per processor.). For iteration 1 each processor starts with a core number I through 10. Then 
the total number of processors is added to the starting number of iteration 1 for iteration 2 to 
some number n. Below is a table for a 1 0 core processor with the core iterand of 10. 

In iteration 1 each core or processor is assigned an initial number from 1 to 10. 
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Shown in iteration 2 is the processor starting number value + the number of processors and 
cores (10 processors or cores). Iterations 3 or more show the starting number plus the core or 
processor total. 

This is an iteration table for 10 microprocessors. The resulting number is multiplied 
5 times the modulus + the remainder in parallel and tested in parallel to find a matching hash. 
The numbers 1 through 10 can also be assigned as processor IDs or starting values or 
numbers. The processors can also be assigned 0 to 9. 



Iteration table cpu n (I to 10) 



I 2 3 4 5 6 7 8 9 10 : Iteration 1 each core is assigned a number 

II 12 13 14 15 16 17 18 19 20 : Iteration 2 -starting value n+ 10(cpus) 
2122 23 2425 26 27 28 29 30 : Iteration3 = n+ 10: So 11 + 10 = 21. 

3 1 32 33 34 35 36 37 38 39 40 : Iteration 4 
4 142 43 44 45 46 47 48 49 50 : Iteration 5 
5 1 52 53 54 55 56 57 58 59 60 : Iteration 6 
6 1 62 63 64 65 66 67 68 69 70 : Iteration 7 



Iteration table cpu n (0 to 9) 

For iteration 1 Each processor is assigned an initial number 

Processor I is assigned a number ID of 0 
25 Processor 2 is assigned a number ID of 1 

Processor 3 is assigned a number ID of 2 
Processor 4 is assigned a number ID of 3 
Processor 5 is assigned a number ID of 4 
Processor 6 is assigned a number ID of 5 
30 Processor 7 is assigned a number ID of 6 

Processor 8 is assigned a number ED of 7 
Processor 9 is assigned a number ID of 8 
Processor 10 is assigned a number ID of 9 

35 

0 12345678 9 : Iteration 1 Assign each core a number (0 - 9) 
10 1 1 12 13 14 15 16 17 18 19 : Iteration 2 = starting value n + 10 (cpus) 
202122232425262728 29: Iteration 3 New number is the previous value n + 1 0 : So f or cpu 0, 

iteration 3 value is 20, 10 + 10 = 20. 
40 30 31 32 33 34 35 36 37 38 39 : Iteration 4 

40 41 42 43 44 45 46 47 48 49 : Iteration 5 
50 5 1 52 53 54 55 56 57 58 59 : Iteration 6 
50 6 1 62 63 64 65 66 67 68 69 : Iteration 7 

45 Example 1 : A number divided by (modulus) 12 is 2 and has a signature of x find the 

number. 

In this iteration table each of the 10 processors splits up the numbers and checks to 
see if there is a given number u * the modulus + the remainder 2 whose signature or 
checksum (where a checksum can be a group of signatures or moduluses or CRCs) is x. It 
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can do this in parallel and halt execution or continue if there is a collision where one or more 
numbers share a signature or signatures and there is a collision number that is being searched 
for. 

(0*12+2-2) (1*12+2=14) (2*12+2=26) (3*12+2=38) (4*12+2=50) (5*12+2=62) 
5 (6*12+2=74) (7*12+2=86) (8*12+2=98) (9*12+2=1 10) : Iteration 1 

(10*12+2=122) (1 1*12+2=134) (12*12+2) (13*12+2) (14*12+2) (15*12+2) 
(16*12+2) (17*12+2) (18*12+2) (19*12+2): Iteration 2 = starting value n + 10 (cpus) 

(20*12+2) (21*12+2) (22*12+2) (23*12+2) (24*12+2) (25*12+2) (26*12+2) 
(27*12+2) (28*12+2) (29*12+2) : Iteration 3 = previous n + 10: So 10 + 10 = 20. 

10 It continues adding and hashing or running the checksum tests on the resulting binary 

number in each processor until it finds a matching checksum or signature or processing is 
halted 

For (0* 12+2=2) die result 2 is run through a checksum and if die output checksum 
matches a given checksum and no collision numbers exist then the number is returned. 

This is a table of 4 processor cores starting with 0 to 3. For each starting processor 
value. It increments the previous value by 4 for each iteration. 
Iteration table cpu n (1 to 4) 

=====—======================= 

0 12 3: Iteration 1: This is multiplied with the modulus plus the 
remainder and hashed and then tested. 
4 5 6 7: Iteration 2 
8 9 10 11 : Iteration 3 

SS=S===SSSSSSSSSS=3S==S3SSS=S=====SSS5 

THE COMPRESSION CHECKSUM ARCHIVE DTD 

There are various XML DTDs to record signatures and hashes. This XML DTD is 
designed to express die markup required for generating XML compression files with Message 
Digests and checksums. A package of signatures and constrictors provides a method of 
finding by brute force a matching block of data associated with a checksum. A Compression 
XML DTD uses the digital signatures and hashes and Modulus to express a unique identifier 
that can be used to reverse scan for a black of data with the signature attributes. A normal 
signature Message Digest DTD is fairly general. The advantage of this DTD is that it can be 
flexible by allowing a program to choose the XML format as attributes or tags. The XML 
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markup consists of hash packages that correspond to files within an archive. The archive is 
an XML tree that contains the files and directories and file structure and the associated 
signature and file metadata. This has two uses one of which is to regenerate the data and the 
other is to allow for security by providing an enhanced Message Digest or signature file for 
verifying the content of data and also to verify that the data has not been altered or changed 

The XML compression archive DTD allows for the content of the XML archive to be 
verified and checked. A file can be tested against a DTD to ensure that the structure and 
content is valid. 

XML allows the structure of a compression file to be seen. The compression archive 
data is not normally seen in regular binary compressed files. Having a file in XML allows for 
the various structures to be seen that denote the file attributes. The disadvantage with XML 
is that there is some overhead in the expression of the data in tags versus binary expression of 
the data. However, the compress.xml files can be zipped or compressed with a regular file 
compressor to reduce their size. 

The XML compression archive can be used as a metabase to record system file 
changes and to crawl the directory tree of a computer file system and provide much greater 
security and integrity checking. 

The following is a demonstrative XML DTD of a signature archive for compression 
and data integrity checking. 

TAGS AND ATTRIBUTES AND EXAMPLES 

There are various XML DTDs to record signatures and hashes. This XML DTD is 

designed to express the markup required for generating XML compression files with Message 

Digests and checksums. A package of signatures and constrictors provides a method of 

finding by brute force a matching block of data associated with a checksum. A Compression 

XML DTD uses the digital signatures and hashes and Modulus to express a unique identifier 

that can be used to reverse scan for a black of data with the signature attributes. A normal 

signature Message Digest DTD is fairly general. The advantage of this DTD is that it can be 

flexible by allowing a program to choose the XML format as attributes or tags. The XML 

markup consists of hash packages that correspond to files within an archive. The archive is 

an XML tree that contains the files and directories and file structure and the associated 

signature and file metadata. This has two uses one of which is to regenerate the data and the 

other is to allow for security by providing an enhanced Message Digest or signature file for 
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verifying the content of data and also to verify that the data has not been altered or changed. 

The XML compression archive DTD allows for the content of the XML archive to be 
verified and checked A file can be tested against a DTD to ensure that the structure and 
content is valid. 

XML allows the structure of a compression file to be seen. The compression archive 
data is not normally seen in regular binary compressed files. Having a file in XML allows for 
the various structures to be seen that denote the file attributes. The disadvantage with XML 
is that there is some overhead in the expression of die data in tags versus binary expression of 
the data. However, the compress.xml files can be zipped or compressed with a regular file 
compressor to reduce their size. 

The XML compression archive can be used as a metabase to record system file 
changes and to crawl the directory tree of a computer file system and provide much greater 
security and integrity checking. 

The following is a demonstrative XML DTD of a signature archive for compression 
and data integrity checking. 

The markup of a archive file can be expressed as tags or as attributes. 

The file out is an example of the tagged markup of an xml archive. Each property of 
a computer file from the file name and file size or file length to the corresponding files digital 
signatures or message digests or checksums can be encoded as a tag. In this instance The 
SHA tag represents the SHA signature of the file and SHAJIE VERSE represents the reverse 
of die file run through the SHA. In XML compression archives various tags and there 
meaning must be well defined The SHAMODULUS tag represents the file and it's 
associated content and a big integer of thousands of bytes with the modulus remainder of the 
SHA digest The MODULUSEXPONENT represents the power that a modulus can be raised 
to by using logarithms to create an exponent A logarithm of the file bytes converted to a big 
integer and modulus used as a base number creates an exponent This exponent captured 
within the MODULUSEXPONENT tag creates an exponent that will represent the exponent 
power the modulus can be raised to calculate the file number. To find the original big integer 
value of the file or block the modulus is raised to an exponent and added by a multiple of the 
modulus incremented by n plus the remainder and run through the digest If there is no 
exponent then die modulus would incremented by some n * modulus plus the modulus 
remainder. 
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<?xml version= n 1.0" encoding="UTF-8° ?> 

<!DOCTYPE COMPRESS_ARCH SYSTEM "compress2.dtd"> 

<FILE> 

<NAME>out</NAME> 

<SIZE>396165K/SIZE> 

<SHA>54173l39a00072fdfa3988flb8cf0e4e9baf31ee</SHA> 

<SHA_REVERSE>5563965239oe4ae6e66b23ed68afcdb8323557 
7b 

</SHA_REVERSE> 

<M05>fllef3dfe3815469a41d8ec29157d32c</MD5> 
<MD4>eOf9al30b5eal256d8c75126d26a6179</MD4> 

<MD2>26080751cl200a69978fdad60f886flf</MD2> 

<FLOOR/> 

<CHL/> 

<SHAMODULUS>31222</SHAMOOULUS> 
<MODULUSEXPONEWT>222</MODULUSEXPONENT> 
<COLLISION_NUMBER> 12</C0LUS10N_NUMBER> 
</FILE> 

An attribute archive encodes the XML data in tap with attributes that are more 
abstract The file can be encoded entirely with tags or attributes or a composite of both tags 
and attributes. 

In this example the filename is palindrome, txt and the md tag represents the message 
digest with attribute t represents the digest type, which is the Secure Hash Algorithm (SHA). 
The mdr tag represents the file data run in reverse through the SHA hash. The md tag with 
the t attribute of MD5 would represent the MD5 message digest The floor or ceiling tag 
would represent some factorial that is above or below the file big integer number (which 
represents the numeric value or product of the bytes of a file). 

<?xml version="1.0" encoding= n UTF-8" ?> 
<!DOCTYPE COMPRESS_ARCH SYSTEM "compress2.dtd"> 
<ARCH> 
<FILE> 

<NAME>palindromctxt</NAME> 
<SIZE>12</S1ZE> 
<md 

t= n SHA">d2d2922f9c0bea8ac448a2c67741eca8bba4a2 

71</md> 

<mdr 

t='SHA">d2d2922f9c0bea8ac448a2c67741eca8bba4a271</mdr 
> <md t="RIPE160"> 

b7bda536er319629b87bla564678907834bdabae</md> 
<mdr t=°RIPE160"> 

b7bda536ef319629b87bla564678907834bdabae</mdr> 

<md 

t="MD5->86e444818581edddef062ad4ddcdOOdd</md> 

<md 

t= MD4->ae93876f99f0013b969313ee5483c05K/md> 
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<md 

t="MD2->141ecbaal4a70771003b4b6973522cl4</md 

> 

<MILLI0NM0DULUS>+98097</MIUJ0NM0DULUS> 
5 <FLOOR>+54</FLOOR> 
<CHL>+55</CBL> 
</FILE> 
</ARCH> 

10 

The next section provides a list of various tags and there usage. 
ELEMENT ARCH 
Model 

15 

<!ELEMENT ARCH (DIRECTORY?, FILE+)> 
<! ATTLIST ARCH name CDATA #IMPLIED> 

Description 

20 

The element Arch represents the start of a compression archive. 

Usage 

ELEMENT DIRECTORY 
Model 

<! ELEMENT DIRECTORY (NAME?, PASSWORD?, FILE+)> 
<! ATTLIST DIRECTORY N CDATA #IMPLIED> 
<!ATTLIST DIRECTORY S CDATA #IMPLIED> 

Description 

The element Directory represents the start of directory file attributes of an archive. A 
directory consists of more than one file element or tag A directory has a name or password 
tag that is optional. The directory name can also be expressed in the n attribute. The s 
attribute can also denote size or the number of directory files. 

Usage 

ELEMENT DIR 
Model 

<! ELEMENT DIR (NAME?, PASSWORD?, FILE+)> 
<! ATTLIST DIR N CDATA #IMPLIED> 
<!ATTLIST DIR S CDATA #IMPLIED> 
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Description 

The element Dir represents the start of directory file attributes of an archive. A 
5 directory consists of more than one file element or tag. A directory has a name or password 
tag that is optional. The directory name can also be expressed in the n attribute. Thes 
attribute can also denote size or the number of directory files. 

Usage 

10 <DIRn= w filename"> 
ELEMENT FILE 
Model 

15 

ELEMENT FILE (NAME?, SIZE?, TIME?, BLOCK*, OS?, BYTEORDER? 
PASSWORD?, CRC?, MD?, MDR?, MDC?, MDX?, SHA, SHAJtEVERSE?, MD2?, 
MD2_REVERSE?, MD3?, MD3_REVERSE?, MD4?, MD4_REVERSE?, MD5?, 
MD5_REVERSE?, COLLISION_NUMBER?, FLOOR?, CEIL?)> 
20 <!- File Name attribute N -> 

<! ATTLIST File N CDATA #IMPLIED> 
<!- File Size in Bytes attribute S -> 
<!ATTLIST File S CDATA #IMPLIED> 

25 Description 

The element File represents the file content including digests and checksums and 
signatures data of a compression archive. A file can have multiple properties within the 
elements or tags including filename file length and collision number or floor or ceiling for 

30 which to scan for a message digest match. The floor or ceiling or modulus or logarithm 
exponents represent constrictors within which to scan for a block of data matching the 
package of digests or checksums associated with a file. A file can have a filename tag or 
operating system tag, password, message digest tag, collisionnumber, floor or ceiling. 
These file tags can also be expressed as attributes or blocks of delineated text Additional 

35 attributes or tags can also be added The byte order tag or possible attribute specifies the file 
or digital signature byte order whether big-endian or little-endian. 

Usage 

ELEMENT NAME 

40 

Model 
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< i ELEMENT NAME (# PCDATA) > 
Description 

5 The element Name represents the start of a name of a file within the compression 

archive. 

Usage 

ELEMENT OS 
Model 

<! ELEMENT OS (# PCDATA) > 
Description 

The element OS represents operating system parameter tag of a file. 

Usage 

ELEMENT BYTEORDER 
Model 

< i ELEMENT BYTEORDER (# PCDATA) > 
Description 

The element BYTEORDER represents byte order parameter tag of a file. This 
specifies whether it is little-endian or big-endian format for the bytes. It can also specify other 
byte-orders of a digital signature input data or file input so that the signature will be 
reproducible on different order computers. 

Usage 

<BYTEORDER>tittle-endian</BYTEORDER> 
<BYTEORDER>big^endian</BYTEORDER> 

Or 

<BYTEORDERH </BYTEORDER> 
<BYTEORDER>b </BYTEORDER> 
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ELEMENT PASSWORD 
Model 

5 <i ELEMENT PASSWORD (tfPCDATA) > 
Description 

The element Password represents the password protection of a file the compression 
10 archive. A password tag represents that some seed has been mixed in with the digests to 
encrypt or protect the data. 

Usage 

ELEMENT BLOCK 

15 

Model 

<i ELEMENT BLOCK (# PCDATA) > 
< ! ATTLIST BLOCK NDM CDATA #REQUIRED> 
20 <! ATTLIST BLOCK LNG CDATA #REQUIRED> 

Description 

The element Block represents the block of a file. This allows for a file to be split into 
25 multiple blocks ofn sized bytes. The blocks can have digests or signatures. The various 
signatures can be nested so that the blocks and tested individually. A modulus scan allows 
for iteration over a series of data to check for a block of data with an associated checksum or 
signature. The entire file output can be tested to ensure that the signature matches the package 
of signatures or checksums. 

30 Usage 

ELEMENT SIZE 

Model 

35 

< 1 ELEMENT SIZE (# PCDATA ) > 
Description 

40 The element size represents the size of a file block or file in bytes. 

Usage 

ELEMENT TIME 
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Model 

< i ELEMENT TIME (# PCDATA ) > 
S Description 

The element time represents the time a signature group was run on a file. 

Usage 

10 <OTME>April 11, 2005 12:33PM</ITME> 

ELEMENT MD 

IS Model 

<!ELEMENT MD (# PCDATA) > 

< iATTLIST MD t CDATA #REQX7IRED> 

< IATTLIST MD 1 CDATA #REQUIRED> 

20 

Description 

The element MD represents the message digest of a file block or file. The t attribute 
represents the digest type and the 1 attribute represents the digest length in bytes. 

25 Usage 

ELEMENT MDR 

Model 

30 

<! ELEMENT MDR (#PCDATA) > 

< iATTLIST MDR t CDATA #REQUIRED> 

< IATTLIST MDR 1 CDATA #REQUIRED> 

35 Description 

The element MDR represents the message digest of the reverse of a file block or file. 
Every hash or digital signature has the reverse transform where the reverse of the input 
generates a different hash. The extreme condition is a palindrome where the signature or 
40 digest is the same forwards or backwards. The t attribute represents the digest type and the 1 
attribute represent the digest length in bytes. 

Usage 

ELEMENT MDX 

45 
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Model 



<! ELEMENT MDX (# PCDATA) > 

< IATTLIST MDX t CDATA #REQUIRED> 
5 < IATTLIST MDX 1 CDATA #REQUIRED> 

Description 

The element MDX represents the user defined digest block. The t attribute represents the 
10 digest 

Type and the 1 attribute represents the digest length in bytes. 

Usage 

ELEMENT MDC 

15 

Model 

<! ELEMENT MDC (# PCDATA) > 

< IATTLIST MDC t CDATA #REQUIRED> 
20 < IATTLIST MDC 1 CDATA #REQUIRED> 

Description 

The element MDC represents a chopped hash or digital signature. Basically this 
25 means that a signature such as SHA can hash a block of data or a file. The resulting 20 byte 
hash or 40 byte hex hash can be chopped to 2 bytes or as many bytes as required. So if a 
checksum or digital signature has a length of 20 bytes then if the MDC length is 2 then only 2 
bytes or 4 characters are selected. A hex version of a 20 byte signature will be 40 characters 
long. The t attribute represents the type and the 1 attribute represents the length. 

30 Usage 

<mdc t="SHA" l="I">d2<ymdO 

<mdc t="RIPE160" I=^^7bda5</mdc> 

<mdc t="MD2" I="2">141e</mdc> 

35 

ELEMENT SHA 
Model 

40 

<! ELEMENT SHA (# PCDATA) > 
Description 

45 The element SHA represents a SHA 1 60 bit digital signature. Basically this means 
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that a signature such as SHA can hash a block of data or a file 



Usage 

5 ELEMENT SHA.RE VERSE 
Model 

< i ELEMENT SHA_REVERSE (# PCDATA) > 

10 

Description 

The element SHA represents a SHA 160 bit digital signature reverse. Basically this 
means that a signature such as SHA reverse can hash a block of data or a file. 

15 Usage 

ELEMENT MD5 
Model 

20 

< J ELEMENT MD5 (# PCDATA) > 
Description 

The element MD5 represents an MD5 (Message Digest 5) digital signature. 
Basically this means that a signature such as MDS can hash a block of data or a file and 
encode the hash as hex within die MDS open and close tags or element The MDC can be 
used to generate hash collisions or to create collision_number tags for small data. 

Usage 

ELEMENT MD5REVERSE 
Model 

<1 ELEMENT MD5_REVERSE (# PCDATA ) > 
Description 

The element MD5 represents an MDS digital signature with reverse input of the file or 
file block. Basically this means that a signature such as MDS can hash the reverse of a block 
of data or a file and encode the hash as hex within the MD5_REVERSE open and close tags 
or element. 

Usage 
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ELEMENT MD4 
Model 

5 

<! ELEMENT MD4 (# PCDATA) > 
Description 

10 The element MD4 represents an MD4 digital signature. Basically this means that a 

signature such as MD4 can hash a block of data or a file and encode the hash as hex within 
the MD4 open and close tags or element 

Usage 

15 ELEMENT MD4_RE VERSE 
Model 

< i ELEMENT MD4_REVERSE (# PCDATA) > 

20 

Description 

The element MD4 represents an MD4 digital signature with reverse input of die file or 
file block. Basically this means that a signature such as MD4 can hash the reverse of a block 
25 of data or a file and encode die hash as hex within the MD4_REVERSE open and close tags 
or element 

Usage 

ELEMENT MD3 

30 

Model 

<! ELEMENT MD3 (#PCDATA) > 
35 Description 

The element MD3 represents an MD3 digital signature. Basically this means that a 
signature such as MD3 can hash a block of data or a file and encode the hash as hex within 
die MD3 open and close tags or element 

40 Usage 

ELEMENT MD3_RE VERSE 
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Model 

<! ELEMENT MD3_REVERSE (# PCDATA) > 
Description 

The element MD3 represents an MD3 digital signature with reverse input of the file or 
file block Basically this means that a signature such as MD3 can hash the reverse of a block 
of data or a file and encode the hash as hex within the MD3_REVERSE open and close tags 
or element 

Usage 

ELEMENT MD2 
Model 

< 'ELEMENT MD2 (# PCDATA) > 
Description 

The element MD2 represents an MD2 digital signature. Basically this means that a 
signature such as MD2 can hash a block of data or a file and encode the hash as hex within 
the MD2 open and close tags or element 

Usage 

ELEMENT MD2RE VERSE 
Model 

<! ELEMENT MD2_REVERSE (# PCDATA ) > 
Description 

The element MD2 represents an MD2 digital signature with reverse input of the file or 
file block. Basically this means that a signature such as MD2 can hash the reverse of a block 
of data or a file and encode die hash as hex within the MD2_REVERSE open and close tags 
or element. 

Usage 

ELEMENT COLLISIONNUMBER 
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Model 

<« ELEMENT COLLISIONNUMBER (#PCDATA) > 
5 Description 

The element Collisioimumber represents the collision number constraint over which 
the modulus scan or skip scan should iterate. This is a constraint to differentiate possible 
blocks of data that share the same signature or signature block. Groups of signatures also 
10 ensure that data is differentiated over a set of iterations. A modulus scan will also 

differentiate and reduce the amount of iterations to find a block of data associated with a 
package of checksums or signatures. However, on occasion that a collision exists (which it 
can in the case of die chop hash tag) the collision number tag will differentiate collisions. 

Usage 

15 

ELEMENT FLOOR 
Model 

20 <! ELEMENT FLOOR (# PCDATA ) > 
Description 

The element Floor represents the floor constraint over which the modulus scan or skip 
25 scan should iterate. This is a constraint to minimize die amount of iterations to find a block 
of data associated with a package of checksums or signatures. Typically this represents a 
bottom up scan. 

Usage 

30 ELEMENT CEIL 

Model 

<i ELEMENT CEIL (# PCDATA) > 

35 

Description 

The element Ceil represents the ceiling constraint over which the modulus scan or 
skip scan should iterate. This is a constraint to minimize the amount of iterations to find a 
40 block of data associated with a package of checksums or signatures. Typically this represents 
a bottom up scan but can also represent a top down or reverse scan with the ceiling as the 
starting point in a search for matches to a given package of attributes for a file or block. 
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Usage 

ELEMENT FACT 
5 Model 

< ! ELEMENT FACT (# PCDATA) > 
Description 

10 

The element Fact represents the Factorial constraint over which the modulus scan or 
skip scan should iterate. This is a constraint to minimize the amount of iterations to find a 
block of data associated with a package of checksums or signatures. Typically this represents 
a bottom up scan but can also represent a top down or reverse scan with the ceiling as the 
1 5 starting point in a search for matches to a given package of attributes for a file or block. 

Usage 

ELEMENT SHAMODULUS 
20 Model 

<! ELEMENT SHAMODULUS (# PCDATA ) > 
Description 

25 

The element SHAMODULUS represents the modulus scan constraint over which die 
modulus scan or skip scan should iterate. This is a constraint to m*n\yr>}7& the amount of 
iterations to find a block of data associated with a package of checksums or signatures. 
Typically this represents a bottom up scan but can also represent a top down or reverse scan 
30 with the ceiling as the starting point in a search for matches to a given package of attributes 
for a file or block. A file and the associated bytes are converted to a big integer or number 
and then the number takes the modulus of the SHA digest to generate the remainder. The 
remainder of the modulus is captured within the SHAMODULUS open and closed tags to 
provide for modulus scans or iteration. 

35 Usage 

ELEMENT MD5MODULUS 

Model 

40 

< i ELEMENT MD5MODULUS (# PCDATA) > 
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Description 

The element MD5MODULUS represents the modulus scan constraint over which the 
modulus scan or skip scan should iterate. This is a constraint to minimize the amount of 
5 iterations to find a block of data associated with a package of checksums or signatures. 
Typically this represents a bottom up scan but can also represent a top down or reverse scan 
with the ceiling as the starting point in a search for matches to a given package of attributes 
for a file or block. A file and the associated bytes are converted to a big integer or number 
and then the number takes the modulus of the MD5 digest to generate the remainder. The 
1 0 remainder of the modulus is captured within the MD5MODULUS open and closed tags to 
provide for modulus scans or iteration. 

Usage 

ELEMENT MODULUS 

15 

Model 

<! ELEMENT MD5 MODULUS (# PCDATA) > 

< ! ATTLIST MODULUS n CDATA #IMPLIED> 

20 

Description 

The element MODULUS represents the modulus scan constraint over which the 
modulus scan or skip scan should iterate. This is a constraint to minimize the amount of 

25 iterations to find a block of data associated with a package of checksums or signatures. 

Typically this represents a bottom up scan but can also represent a top down or reverse scan 
with the ceiling as the starting point in a search for matches to a given package of attributes 
for a file or block. A file and the associated bytes are converted to a big integer or number 
and then the file number big integer takes the modulus of the n attribute to generate the 

30 remainder. The remainder of the modulus is captured within the MODULUS open and 
closed tags to provide for modulus scans or iteration. The modulus tag can have a default 
value. 

Usage 

35 The tag captures the modulus remainder of die file converted to a big integer modulus 

some value captured in the n attribute. 

ELEMENT HUNDREDMODULUS 
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Model 



< 1 ELEMENT HDNDREDMODULUS (# PCDATA) > 
Description 

5 

The element HUNDREDMODULUS represents the modulus scan constraint over 
which the modulus scan or skip scan should iterate. This is a constraint to minimize the 
amount of iterations to find a block of data associated with a package of checksums or 
signatures. Typically this represents a bottom up scan but can also represent a top down or 
1 0 reverse scan with the ceiling as the starting point in a search for matches to a given package 
of attributes for a file or block. A file and the associated bytes are converted to a big integer 
or number and then the filenumber takes the modulus 100 to generate the remainder. The 
remainder of the modulus is captured within the MODULUS open and closed tags to provide 
for modulus scans or iteration. 

15 Usage 

ELEMENT THOUSANDMODULUS 

Model 

20 

< I ELEMENT THOUSANDMODULUS (# PCDATA ) > 
Description 

25 The element THOUSANDMODULUS represents the modulus scan constraint over 

which the modulus scan or skip scan should iterate. This is a constraint to minim ize the 
amount of iterations to find a block of data associated with a package of checksums or 
signatures. Typically this represents a bottom up scan but can also represent a top down or 
reverse scan with die ceiling as the starting point in a search for matches to a given package 

30 of attributes for a file or block. A file and the associated bytes are converted to a big integer 
or number and then the filenumber takes die modulus 1000 to generate the remainder. The 
remainder of the modulus is captured within the THOUSANDMODULUS open and closed 
tags to provide for modulus scans or iteratioa 

Usage 

35 

ELEMENT MILLIONMODULUS 
Model 
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<! ELEMENT MILLIONMODULUS (#PCDATA) > 
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Description 

The element MILLIONMODULUS represents die modulus scan constraint over 
5 which die modulus scan or skip scan should iterate. This is a constraint to minimize the 
amount of iterations to find a block of data associated with a package of checksums or 
signatures. Typically this represents a bottom up scan but can also represent a top down or 
reverse scan with die ceiling as the starting point in a search for matches to a given package 
of attributes for a file or block. A file and the associated bytes are converted to a big integer 
10 or number and then the filenumber takes the modulus of 1 million to generate die remainder. 
The remainder of the modulus is captured within the MILLIONMODULUS open and closed 
tags to provide for modulus scans or iteration. 

Usage 

15 ELEMENT BILLIONMODULUS 
Model 

<! ELEMENT BILLIONMODULUS (# PCDATA) > 
Description 



20 



The element BILLIONMODULUS represents the modulus scan constraint over which 
the modulus scan or skip scan should iterate. This is a constraint to minimize the amount of 

25 iterations to find a block of data associated with a package of checksums or signatures. 

Typically this represents a bottom up scan but can also represent a top down or reverse scan 
with the ceiling as the starting point in a search for matches to a given package of attributes 
for a file or block. A file and the associated bytes are converted to a big integer or number 
and then the filenumber takes the modulus of 1 billion to generate the remainder. The 

30 remainder of the modulus is captured within the BILLIONMODULUS open and closed tags 
to provide for modulus scans or iteration. 

Usage 

ELEMENT DMOD 
Model 
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<! ELEMENT DMOD (# PCDATA ) > 
< !ATTLIST BLOCK A CDATA # REQUIRED > 
40 < iATTLIST BLOCK B CDATA #REQUIRED> 
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Description 

5 The element DMOD represents the double modulus scan constraint over which the 

modulus scan or skip scan should iterate. This is a constraint to minimize the amount of 
iterations to find a block of data associated with a package of checksums or signatures. 
Typically this represents a bottom up scan but can also represent a top down or reverse scan 
with the ceiling as the starting point in a search for matches to a given package of attributes 

10 for a file or block. A file and the associated bytes are converted to a big integer or number 
and then die binary filenumber takes a pair of modulus numbers to generate remainder pair. 
The remainder of the modulus is captured within the DMOD open and closed tags to provide 
for modulus scans or iteration. The pair of modulus numbers can be set by using them within 
the attribute a and attribute b and then putting the remainders within the DMOD tag separated 

IS with commas. If there are three arguments then the first number is modulus a then a comma 
and the second number is an exponent that modulus a can be raised to and then a comma 
precedes die third number which is the second modulus. This enables one modulus to be 
iterated over and a second modulus to be used as a test that is run before any other signatures 
for checksum verification. 

20 An example of this is the following numeric problem is where there is a binary number x that 
divided by a million (modulus a) is 12333 and divided by (123333) modulus b is 1232 and 
whose SHA signature is 20 bytes and whose MDS signature is 16 bytes and the modulus a 
has an exponent power of 23 find the number. It also creates a modulus pair for the file 
signature. 

25 Usage 

<DMOD a="100<MMHP b~ n 123332 ,, >12333,23,1232</DMOD> 
<DMOD>123332,123332</DMOD> 

30 

ELEMENT MODULUSEXPONENT 
Model 

35 <! ELEMENT MODULUSEXPONENT (# PCDATA) > 
Description 

The element MODULUSEXPONENT represents the modulus scan constraint over 
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which the modulus scan or skip scan should iterate. This is a constraint to minimize the 
amount of iterations to find a block of data associated with a package of checksums or 
signatures. Typically this represents a bottom up scan but can also represent a top down or 
reverse scan with the ceiling as the starting point in a search for matches to a given package 
5 of attributes for a file or block. A file and the associated bytes are converted to a big integer 
or number and then the filenumber takes the modulus of n to generate die remainder. The 
modulus can be raised to a power or exponent generated by the logarithm of the file number 
big integer and the modulus base to find an exponent to which the modulus can b& iterated + 
the remainder. (Le., (modulus A 1200 th power) + (modulus * 1000) + remainder) to reduce 
10 the number of iterations to find a block. 

Usage 

<BIUJONMODULUS>12333<MODULUSEXPONENT>122</MODULUSEXPONEN 
Tx/BILLIONMODULUS> 

15 

ELEMENT MODULUSMULTIPLE 
Model 

20 < ! ELEMENT MODULUSMULTIPLE (# PCDATA) > 
Description 

The element MODULUSMULTIPLE represents the modulus scan constraint over 
25 which the modulus scan or skip scan should iterate. This is a constraint to minimize the 
amount of iterations to find a block of data associated with a package of checksums or 
signatures. Typically this represents a bottom up scan but can also represent a top down or 
reverse scan with die ceiling as the starting point in a search for matches to a given package 
of attributes for a file or block. A file and the associated bytes are converted to a big integer 
30 or number and then the filenumber takes the modulus of n to generate die remainder. The 
modulus can be multiplied by the modulus multiple (ie 1 000 * modulo) to reduce die number 
of iterations to find a block. 

Usage 

35 <MODULUSMULTIPLE>1000<MODULUSMULTIPLE> 

This is a second usage example. This represents an alternative usage where the Modulus of a 
number is 1 hundred and the remainder is 1 2333 and the modulus of 1 hundred can be raised 
to an exponent of 122 and has a multiple of 1000. This forms the equation there is a number 
40 that divided by a hundred is 33 and whose modulus exponent is 122 and has a modulus 
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multiple of 1000 find the number. The modulus multiple is separate from the exponent so it 
forms an equation 100* 122 + 100*1000 



<HUNDREDMODULUS>33<MODlJLUSEXPONENT>122</MODULUSEXPONEN^ 
5 ><MODULUSMULTIPLE>1000</MODULUSMULTIPLE> 
</HUNDREDMODULUS> 



ELEMENT NUM 

10 

Model 

<! ELEMENT NUM (# PCDATA) > 
15 Description 

The element NUM represents a hex output of some of the bytes of a file or a general 
number to define the modulus number. 
Usage 

20 

<NUM>100000</NUM> 
<MODULUS><NUM>100^^ 

ELEMENT CRC 

25 

Model 

<! ELEMENT CRC (# PCDATA) > 
30 Description 

The element CRC represents a CRC checksum. 

Usage 

35 

Below is a basic checksum digital signature XML archive DTD. 
Compress2.dtd 



<! ELEMENT ARCH (DIRECTORY?, FILE+)> 
40 < IATTLIST ARCH name CDATA #IMPLIED> 

<! ELEMENT DIRECTORY (NAME? , PASSWORD?, FILE+)> 
<!ATTLIST DIRECTORY N CDATA #IMPLIED> 

< IATTLIST DIRECTORY S CDATA #IMPLIED> 

45 

<» ELEMENT DIR (NAME?, PASSWORD? , FILE+)> 
<»ATTLIST DIR N CDATA #IMPLIED> 

< IATTLIST DIR S CDATA #IMPLIED> 
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<! --Element File: m <FlLE N=- Filename" s=*1232233*>. . .</FILE>* 
<! --Filename or Filesize can be specified as attribute or element tag --> 
5 < l ELEMENT FILE (NAME?, SIZE?, TIME?, BLOCK*, OS?, BYTEORDER? , PASSWORD?, 
CRC?, MD?, MDR?, MDC?, MDX?, SHA, SHA_REVERSE? , MD2? , MD2 — REVERSE? , MD3?, 
MD3_REVERSB? , MD4?, MD4_REVERSE? , MD5?, MD5_REVERSE? , COLLISION NUMBER? , 
FLOOR?, CEIL?)> 
<!--File Name attribute N --> 
10 < IATTLIST File N CDATA #IMPLIED> 

<!--File Size in Bytes attribute S — > 

< IATTLIST File S CDATA #IMPLIED> 

<! --Element Name is the File Name --> 
15 <! ELEMENT NAME (# PCDATA ) > 

<! — Element OS is the Operating System Type — > 
<! ELEMENT OS (# PCDATA ) > 

<! --Element BYTEORDER is the BYTEORDER of the computer --> 

< I ELEMENT BYTEORDER (# PCDATA) > 
20 <! --Element Password of file --> 

<! ELEMENT PASSWORD (# PCDATA) > 

<! ELEMENT BLOCK (# PCDATA ) > 

< IATTLIST BLOCK NUM CDATA #REQUIRED> 
25 < IATTLIST BLOCK LNG CDATA #REQUIRED> 

< ! --File Size Bytes element — > 
<! ELEMENT SIZE (#PCDATA)> 
<!--File Signature Time --> 
30 <! ELEMENT TIME (# PCDATA ) > 



<! --ELEMENT MD: User Defined digest: 
attribute t = Message Digest type 
35 attribute 1 = Message Digest length- -> 
<! ELEMENT MD {# PCDATA) > 

< IATTLIST MD t CDATA #REQUIRED> 

< IATTLIST MD 1 CDATA #REQUIRED> 

40 <! — ELEMENT MDR: User Defined digest of reversed input: 
attribute t = Message Digest type 
attribute 1 » Message Digest length --> 

< 1 ELEMENT MDR (# PCDATA) > 

< IATTLIST MDR t CDATA # REQUIRED > 
45 < IATTLIST MDR 1 CDATA #REQUIRED> 

< I --ELEMENT MDX: User Defined digest of input: --> 

< ! - -Examples: *<MDX>SHA: 160 : 54173139a00072f df a3988f lb8cf 0e4e9baf 31ee</MDX>* 

— > 

50 <! ELEMENT MDX (#PCDATA)> 



<! --ELEMENT MDC: Chopped sub hash of input 

A message digest is run on an input and then chopped. 
55 The message digest is chopped at n bytes. 

So a 20 byte SHA digest can be chopped for small input 
files to create variable length hashes. 

An example SHA "<SHA>54173139a00072fdfa3988f lb8cf 0e4e9baf 31ee</SHA>" 
would be chopped to 2 bytes with the markup 
60 *<MDC t=*SHA- 1="2'>54</MDC>* --> 

<! ELEMENT MDC (# PCDATA ) > 
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< ! ATTLIST MDC t CDATA #REQUIRED> 
< ! ATTLIST MDC 1 CDATA #REQUIRED> 



<! --ELEMENT SHA: Secure hash Algorithm --> 

<! ELEMENT SHA (# PCDATA ) > 

< I ELEMENT SHA_REVERSE (# PCDATA) > 

<! — ELEMENT MDx: Message Digest Algorithm --> 
< I ELEMENT MD2 (# PCDATA) > 

< I ELEMENT MD2__REVERSE (# PCDATA) > 
<! ELEMENT MD3 (# PCDATA ) > 

<! ELEMENT MD3REVERSE (# PCDATA) > 
<! ELEMENT MD4 (# PCDATA ) > 

< I ELEMENT MD4_REVERSB (# PCDATA) > 
<! ELEMENT MD5 (# PCDATA ) > 

<! ELEMENT MD5_REVERSE (# PCDATA ) > 

< i- -ELEMENT Collision Number: 

A collision occurs when there is a message digest in which 

two inputs produce the same output digest list 

A collision number can be used to differentiate collisions. 

The first Input that produces the digest is collision number one. 

Successive collision increment the number by one. 

The 12 th collision sets the collision number to 12. 

The markup would be *<collision_number>12</collision_number>* 

- - > 

<! ELEMENT COLLISION_NUMBER (# PCDATA) > 

< I ELEMENT FLOOR (NUM?, PACT?)> 
<! ELEMENT CEIL (NUM?, PACT?)> 
< I ELEMENT PACT (# PCDATA ) > 
<! ELEMENT SHAMODULUS (# PCDATA) > 

< I ELEMENT MD 5 MODULUS (# PCDATA) > 

< 1 ELEMENT MODULUS (# PCDATA) > 

<! ATTLIST MODULUS n CDATA #IMPLIED> 
<! ELEMENT HUNDREDMODULUS (# PCDATA) > 
<! ELEMENT THOUSANDMODULUS (# PCDATA) > 
<! ELEMENT MILLIONMODULUS {# PCDATA) > 
<! ELEMENT BILLIONMODULUS (# PCDATA) > 

< i ELEMENT DMOD (# PCDATA ) > 

<! ELEMENT MODULUSEXPONENT (# PCDATA ) > 

< I ELEMENT MODULUSMULTI PLE (# PCDATA ) > 
<! ELEMENT NUM {# PCDATA ) > 
<! ELEMENT CRC (#PCDATA)> 



EXAMPLES 



This is a list of markup examples in the compress2.dtd format 



Example 1: palindrome.xml 

Example 1 demonstrates a file named palindrome.txt from which the size which is 12 
bytes followed by the SHA digest, SHA reverse digest Ripe 160 digest and Ripe 160 reverse 
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digest enclosed in tags. The md tag represents a message digest and the t attribute represents 
the type. The floor and ceiling are empty. The file is well formed and enclosed within the 
arch tag. 



5 <?xml version="l .0" encoding="ascii"?> 

<!DOCTYPE COMPRESS. ARCH SYSTEM u compress2.dtd"> 

<ARCH> 

<FILE> 

<NAME>palindrome.txt</NAME> 

10 <SIZE>12</SIZE> 

<md t="SH A">d2d2922f9c0bea8ac448a2c67741 eca8bba4a27 1 </md> 

<md t="SHA_REVERSE' > >d2d2922f9c0bea8ao448a2c6774 1 eca8bba4a27 1 </md> 

<mdt="ripel60^7bda536^19629b87bla564678907834bdabae</md> 

<md t="ripel 60_reverse">b7bda536ef3 1 9629b87b 1 a564678907834bdabae</md> 

15 <md t="MD5^86e444818581edddef062ad4ddcd00dd<ymd> 

<md t=^4D4">ae93876f99f0013b969313ee5483c051</md> 
<mdt= n MD2">141ecbaal4a70771003b4b6973522cl4<ymd> 
<FLOORx/FLOOR> 
<CEILx/CEIL> 

20 </FILE> 
</ARCH> 

Example 2: 

25 This is an xml example of the directory markup for an archive. Multiple files can 

exist within a directory and files can exist outside of a directory in separate markup. 



<?xml versions" 1.0" encoding="ascii"?> 
<!DOCTYPE COMPRESS_ARCH SYSTEM "compress2.dtd"> 
30 <ARCH> 

<DIR n=T):\Programs\Perl\ M > 
<FILE> 

<NAME>D:\Programs\Peri\bigfloatpl</NAME> 

<SIZE>1149</SIZE> 
35 <md ^SHA">fed8cf9dblad882e89c8987c6dcd435e98d767b3</md> 

<md t="SH A_REVERSF*>33417958eal 8546542dd5 1 c4bd9986e5d5da9d74</md> 

<md t="MD5">71 10 If560d421 12c4a0780bcd5051a9</md> 

<md t="MD4*^7ba62b83cb30209158db3e97694M863<ymd> 

<md t="MD2">e541 9e9a7 1 24852e9fa9fa9004ceabbc<Vmd> 
40 <FLOORx/FLOOR> 

<CEILx/CEIL> 

</FILE> 

</DIR> 

<F1LE> 

45 <NAME>D:\Programs\PerI\compress.dtd</NAME> 
<SIZE>363</SIZE> 

<mdt="SHA">42e7e84866aadf4ceO3f0d962flT62ee658791bb</md> 
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<md t="SHA_REVERSE">5c0e8287cfcal 32b0a84063f8d7cc5c61 ed4589a</md> 

<md t="MD5">730fafl 67flb3c36e47ael ec0cb74el 9</md> 

<md t="MD4">42601 8e86d668ecffc0874c6f63c9ed2</md> 

<md t="MD2">bfef9fdb02d3f5O9bf827I0ca0fa233a</md> 

<FLOOR?x/FLOOfc> 

<CEILx/CEIL> 

</FILE> 

<ARCH> 



Example 3 

Example represents the use of die collision number 



<?xml version—' 1.0" encoding="ascii"?> 

<!DOCTYPE COMPRESS_ARCH SYSTEM "compress2.dtd"> 

<ARCH name=""> 

<FILE> 

<NAME>out</NAME> 
<SIZE>3961651</SIZE> 

<SHA>541 73 139a00072fdfii3988flb8cfDe4e9baf3 lee</SHA> 

<SHA_REVERSE>5563965239ce4ae6e66b23ed68afcdb83235577b</SHA_REVER 
SE> 

<MD5>fl Iefldfe3815469a41d8ec29157d32c</MD5> 

<MD4>e0f9al30b5eal256d8c75126d26a6179</MD4> 

<MD2>2608075 1 cl200a69978fdad60f886n f</MD2> 

<FLOORx/FLOOR> 

<CEILx/CEIL> 

<COLLISION_NUMBER>12</COLLISION_NUMBER> 

</FILE> 

</ARCH> 

Example 4 

This example illustrates a file bytes converted to a big integer number modulus 1 
million enclosed in the million-modulus tag. There is also a floor and ceiling number. 



<?xml version="1.0" encoding="ascii n ?> 

<!DOCTYPE COMPRESS_ARCH SYSTEM "compress2.dtd"> 

<ARCH> 

<FILE> 

<NAME>palindrome.txt</NAME> 
<SIZE>12</SIZE> 

<md t="SHA">d2d2922f9c0bea8ac448a2c6774 1 eca8bba4a271</md> 
<inart="SHA^d2d292219d)bea8ac448a2c67741eca8bba4a271</mdr> 
<mdt= n RIPE160">b7bda536ef319629b87bla564678907834bdabae</md> 
<mdrt="RIPE160'^>b7bd^36eO19629b87bla564678907834bdabae<7mdr> 
<md t= n MD5 M >86e4448I8581edddefD62ad4ddcd00dd</md> 
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<mdt="MD4^93876f99f0013b969313ee5483c051</md> 
<mdt="MD2">141ecbaal4a70771003Mb6973522cl4</md> 
<MILLIONMODULUS>+98097</MILLIONMODULUS> 
<FLOOR>+54</FLOOR> 
5 <CEIL>f55</CEIL> 
</FILE> 
</ARCH> 

Example 5 

10 

This example illustrates a chopped digest tag mdc where only 1 or 2 or 3 bytes of die 
digest for sha and ripe 160 or md2 are used. The number of bytes used in expressed in the 1 
attribute. 



1 5 <?xml version="l .0" encoding= n ascii"?> 

<!DOCTYPE COMPRESS_ARCH SYSTEM "compress2.dtd"> 

<ARCH> 

<FILE> 

<NAME>palindrome.txt</NAME> 
20 <SIZE>12</SIZE> 

<mdc t="SHA" l="l">d2<ymd> 

<mdc t="RIPE160" I="3">b7bda5</md> 

<mdc t="MD2" I="2">14Ie<^md> 

<MILLIONMODULUS>+98097</MILLIONMODULUS> 
25 <FLOOR>+54</FLOOR> 

<CEIL>455<yCEII> 

</FILE> 

</ARCH> 

30 Example 6 

This example illustrates the use of the file blocks for the file out The file was divided 
into two different blocks 2,000,000 and 1,961,651 bytes long and the bytes would then be 
passed through a digest SHA and SHA Reverse. The digest values are simple demonstrated 
35 uncalculated values. 



<?xml version="1.0" encoding="ascii"?> 
<!DOCTYPE COMPRESS_ARCH SYSTEM "compressl.dtd'^ 
<ARCH> 
40 <FILE> 

<NAME>out</NAME> 
<SIZE>3961651</SIZE> 

<SHA>54173139a()0072f<lfa3988flb8cf[)e4e9baf31ee</SHA> 
<SHA_REVERSE>5563%5239(»4ae6e66b23ed68afcdb83235577b<ySHA_REVER 
45 SE> 

<MD5>fl Ie0dfe3815469a41d8ec29157d32c</MD5> 
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<MD4>e0f9al30b5eal256d8c75126d26a6179<MD4> 
<MD2>26080751cl200a69978fdadO)i886nf</MD2> 
<FLOORx/FLOOR> 
<CEILX/CEIL> 
5 <BLOCKmim= M l"lng="2000000^> 

<SHA>1 1111 100a00072fdfa3988flb8cf0e4€9baOlee</SHA> 
<SHA_REVERSE>00001 1 1 123ce4ae6e66b23ed68afcdb83235577b</SHA_REVER 
SE> 

</BLOCK> 

10 <BLOCKnum='7 w lng="1961651"> 

<SHA>5c0e8287cfcal32bOa84063f8d7cc5c61ed4589a</SHA> 
<SHA_REVERSE>123221213323223221 123223232332323321 1 1 1</SHA_REVER 
SE> 

</BLOCK> 
15 </FILE> 
</ARCH> 



While the present invention has been illustrated and described above regarding 
20 various embodiments, it is not intended to be limited to the details shown, since various 
modifications and structural changes may be made without departing in any way from the 
spirit of the present invention. Without further analysis, the foregoing will so fully reveal the 
gist of the present invention that others can, by applying current knowledge, readily adapt it 
for various applications without omitting features that, from the standpoint of prior art, fairly 
25 constitute essential characteristics of the generic or specific aspects of this invention. 



CLAIMS 



1. A system for data storage comprising: 

one or more processors operable to generate a first checksum value for a data block 
5 and a second checksum value for the data block, wherein said first checksum value is 

generated by applying a first checksum algorithm to said data block and said second 
checksum value is generated by applying a second checksum algorithm, different 
from said first checksum algorithm, to said data block; 

one or more processors operable to create a data entry comprising data identifying: 
10 the first and second checksum values, the first and second checksum algorithms, and 

at least one of the identified attributes of the data block; and 

one or more processors operable to store said data entry in a computer-readable 
medium. 

2. A system for data storage comprising: 

IS one or more processors operable to identify one or more attributes of a first data block 

and a second data block, said second data block comprising and different from said 
first data block; 

one or more processors operable to generate a first checksum value for the first data 
block, wherein said first checksum value is generated by applying a first checksum 
20 algorithm to said first data block; 

one or more processors operable to generate a second checksum value for the second 
data block, wherein said second checksum value is generated by applying a second 
checksum algorithm to said second data block, 

one or more processors operable to create a data entry comprising data identifying: 
25 the first and second checksum values, and at least one of the identified attributes of 

die first and second data blocks; and 

one or more processors operable to store said data entry in a computer-readable 
medium. 

3. The system of claim 1 further comprising: 

30 one or more processors are further operable to determine an attribute for the data 
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block, said attribute being one of a name, size, length, hash type, checksum type, 
digest type, padding, floor, ceiling, modulus, collision, directory, root, drive, path, 
date, time, modified date, permission, owner, or byte order, 



one or more processors operable to create a data entry comprising the attribute; and 
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one or more processors operable to store said data entry in a computer-readable 



medium. 

4. The system of claim 1 , wherein the second checksum algorithm is die first checksum 
algorithm. 

5. The system of claim 1 , wherein the attributes comprise at least one of the following: 



modulus, collision, directory, root, drive, path, date, time, modified date, permission, 
owner, and byte order. 

6. The system of claim 1 , wherein die data entry is written in a markup language. 

7. The system of claim 6, wherein the markup language is one of either XML or SGML. 

1 S 8. The system of claim 1 , wherein the one or more checksum values is at least one of: a 
hashed value, a digest, and a checksum number. 

9. The system of claim 1 , wherein the one or more checksum values is generated using at 
least one of an MD2 algorithm, an MD4 algorithm, an MDS algorithm, an SHA 
algorithm, a Cyclic Redundant Checksum algorithm, a Ripe algorithm, a CRC16 

20 checksum algorithm, a CRC32 checksum algorithm, and a CRC64 checksum algorithm. 

10. The system of claim 1, wherein at least 2 of said one or more processors operates in 
parallel 

1 1 . A system for data recovery comprising: 
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name, size, length, hash type, checksum type, digest type, padding, floor, ceiling, 
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one or more processors operable to receive a data entry comprising data identifying: 
first and second checksum values, first and second checksum algorithms, and at least 
one attribute of a first data block; and based on said data entry; 
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one or more processors operable to operable to identify said first data block by: 
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blocks; 

(b) comparing said first set of checksum values to said first checksum value, 

(c) identifying one or more first candidate data blocks as potentially being said 
first data block. 

5 12. The system of claim 1 1 , further comprising one or more processors operable to identify 
one or more first candidate data blocks as corresponding to values in said first set of 
checksum values that are equal to said first checksum value. 

13. The system of claim 1 1, further comprising: 

one or more processors operable to generate a second set of checksum values by applying 
1 0 said second checksum algorithm to said first candidate data blocks; 

one or more processors operable to compare said second set of checksum values to said 
second checksum value; 

one or more processors operable to identify a second set of candidate data blocks as 
corresponding to values in said second set of checksum values equal to said second 
1 S checksum value; and 

one or more processors operable to identify all data blocks in said second set of candidate 
data blocks as potentially being said first data block. 

14. The system of claim 1 1 , wherein die first checksum algorithm is applied to selected data 
blocks in the first set of data blocks through one of at least a linear scan or nonlinear scan. 

20 15. The system of claim 14, wherein the nonlinear scan comprises one of a skipping scan, a 
modulus scan, or an exponential scan. 

16. The system of claim 1 1 , wherein each candidate data block is assigned a unique collision 
number. 

17. The system of claim 1 1, wherein at least one of die one or more processors comprises an 
25 integer calculation unit and one or more hash registers. 

18. A system for data storage comprising: 

computer implemented means for generating a first checksum value for a first data 
block and a second checksum value for the first data block; 

computer implemented means for creating a data entry comprising the first and 



second checksum values; and 

computer implemented means for storing said data entry in a computer-readable 
medium. 

19. A system for data storage comprising: 

computer implemented means for identifying one or more attributes of a data block; 

computer implemented means for generating a first checksum value for the data block 
and a second checksum value for the data block, wherein said first checksum value is 
generated by applying a first checksum algorithm to said data block and said second 
checksum value is generated by applying a second checksum algorithm, different 
from said first checksum algorithm, to said data block; 

computer implemented means for creating a data entry comprising data identifying: 
the first and second checksum values, the first and second checksum algorithms, and 
at least one of the identified attributes of the data block; and 

computer implemented means for storing said data entry in a computer-readable 
medium. 

20. A system for data recovery comprising: 

computer implemented means for identifying one or more attributes of a first data 
block and a second data block, said second data block comprising and different from 
said first data block; 

computer implemented means for generating a first checksum value for the first data 
block, wherein said first checksum value is generated by applying a first checksum 
algorithm to said first data block; 

computer implemented means for generating a second checksum value for the second 
data block, wherein said second checksum value is generated by applying a second 
checksum algorithm to said second data block, 

computer implemented means for creating a data entry comprising data identifying: 
the first and second checksum values, and at least one of the identified attributes of 
die first and second data blocks; and 

computer implemented means for storing said data entry in a computer-readable 
medium 
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